diff --git a/.gitattributes b/.gitattributes
index 48e1fe032537c7401a1b7f2740bf08603da43745..a6344aac8c09253b3b630fb776ae94478aa0275b 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-training_time.png filter=lfs diff=lfs merge=lfs -text
diff --git a/.gitignore b/.gitignore
index 971d5aceb2f69d9fc351ce6601c6ce055e4a07e3..3f992ad4c0a1f5004e4c6b529902b67a7fc3c2c3 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,3 @@
 __pycache__/
 trainings/
 scores/
-slurm/
-
diff --git a/README.md b/README.md
index aadb3f75f2745b4aadc68b15d938b1b42dd0a66f..7be5fc7f47d5db027d120b8024982df93db95b74 100644
--- a/README.md
+++ b/README.md
@@ -1,344 +1,3 @@
 ---
 license: mit
 ---
-## Abstract
-
-We introduce a foundation model for event classification in high-energy physics, built on a **Graph Neural Network** architecture and trained on **120 million simulated proton-proton collision events** spanning 12 distinct physics processes. The model is *pretrained* to learn a general and robust representation of collision data using challenging multiclass and multilabel classification tasks.
-
-Its performance is evaluated across five event classification tasks, which include both physics processes used during pretraining and new processes not encountered during pretraining. Fine-tuning the pretrained model significantly improves classification performance, particularly in scenarios with limited training data, demonstrating gains in both accuracy and computational efficiency.
-
-To investigate the underlying mechanisms behind these performance improvements, we employ a representational similarity evaluation framework based on *Centered Kernel Alignment*. This analysis reveals notable differences in the learned representations of fine-tuned pretrained models compared to baseline models trained from scratch.
-
-## Introduction
-
-Machine learning has become a ubiquitous tool in particle physics, employed in a variety of tasks including triggering, simulation, reconstruction, and offline analysis. While its utility spans classification, regression, and generative tasks, the current paradigm of developing machine learning models from scratch for each specific application presents several challenges. This approach not only demands specialized expertise and substantial computing resources but can also result in suboptimal performance due to limited training data. The from-scratch development of models necessitates individual validation studies to ensure that neural networks utilize well-modeled information from training samples, whether derived from Monte Carlo simulations or control samples from experimental data.
-
-Foundation models offer a promising direction to address these limitations. These models, pre-trained on large, diverse datasets across various tasks, provide robust and general representations of underlying data structures. Notable examples in other fields include GPT-4 [OpenAI et al., 2024](#ref-openai-2024-gpt4) and BERT [Devlin et al., 2018](#ref-devlin-2018-bert) in natural language processing, Stable Diffusion [Rombach et al., 2021](#ref-rombach-2021-latentdiffusion) in image processing, and AlphaFold [Jumper et al., 2021](#ref-jumper-2021-alphafold) in structural biology. The foundation model approach offers several advantages for particle physics applications: reduced computing resources for fine-tuning [Yosinski et al., 2014](#ref-yosinski-2014-transfer) compared to training from scratch, superior performance on specific tasks (particularly with limited training data), and potentially simplified validation procedures as downstream tasks inherit verified representations from the pre-trained model.
-
-Current literature on pretrained models for particle physics can be categorized based on the data representation they handle. Models operating on particle- or event-level numerical data use features like particle four momenta or jets, leveraging self-supervised or generative methods to learn versatile representations. Detector-focused models operate on high-dimensional responses such as calorimeter deposits or pixel hits, employing geometry-aware techniques for accurate simulation and analysis. Finally, models using textual or code representations apply large language model architectures to integrate domain knowledge, enabling tasks like question answering and code generation.
-
-Recent studies have begun exploring foundation models tailored to particle physics data, which has a variety of distinct structures and properties across many experiments and data processing stages, including:
-
-- particle-level & event-level numeric data [Wildridge et al., 2024](#ref-wildridge-2024-bumblebee), [Katel et al., 2024](#ref-katel-2024-jet), [Golling et al., 2024](#ref-golling-2024-maskedset), [Mikuni & Nachman, 2024](#ref-mikuni-2024-omnilearn), [Harris et al., 2024](#ref-harris-2024-resimulation), [Birk et al., 2024](#ref-birk-2024-omnijet), [Vigl et al., 2024](#ref-vigl-2024-finetune),
-- detector-level & geometry-aware data [Araz et al., 2024](#ref-araz-2024-pointcloud), [Liu et al., 2023](#ref-liu-2023-gaam), [Hashemi et al., 2024](#ref-hashemi-2024-gen), [Huang et al., 2024](#ref-huang-2024-lmtracking),
-- textual or code data [Zhang et al., 2024](#ref-zhang-2024-xiwu).
-
-This paper presents a foundation model designed specifically for collider event-level data. In modern collider experiments, final-stage analysis processes information from reconstructed objects that either directly correspond to particles in collision final states (such as leptons and photons) or serve as proxies (such as jets and missing transverse energy). While traditional approaches often relied on "high-level" variables calculated from object features, recent trends favor direct input of event objects and their features into neural networks for analysis tasks. A notable example is [ATLAS Collaboration, 2023](#ref-atlas-2023-4top), which established the observation of simultaneous production of four top quarks with the ATLAS experiment by employing a graph neural network (GNN) architecture to process event-level object information.
-
-We present foundation models that adopt an architecture similar to that used for [ATLAS Collaboration, 2023](#ref-atlas-2023-4top). Our models are pre-trained using either multiclass classification or multi-label learning tasks across 12 distinct physics processes. We evaluate these models through fine-tuning and testing on five classification tasks, including both familiar and novel processes not seen during pre-training. Our analysis benchmarks the models' performance improvements, their scaling behavior with training sample size, and computational efficiency, representing the first prototype of a foundation model operating on collider final-state object data.
-
-## Data Samples
-
-To provide a diverse set of physics processes for the pretraining, we use Madgraph@NLO 2.7.3 [Alwall et al., 2014](#ref-alwall-2014hca) to generate proton-proton collision events at next-to-leading order (NLO) in Quantum Chromodynamics (QCD). We generate 12 distinct Standard Model (SM) physics processes, including six major Higgs boson production mechanisms: gluon fusion production \\(ggF\\), vector boson fusion \\(VBF\\), associated production of the Higgs boson with a W boson \\(WH\\) or a Z boson \\(ZH\\), associated production of the Higgs boson with a top-quark pair \\(t\bar{t}H\\), and associated production of the Higgs boson with a single top quark and a forward quark \\(tHq\\). Additionally, we simulate six top quark production processes: single top production, top-quark pair production \\(t\bar{t}\\), top quark pair production in association with a pair of photons \\(t\bar{t}\gamma\gamma\\), associated production of a top-quark pair with a W boson \\(t\bar{t}W\\), simultaneous production of three top quarks \\(t\bar{t}t\\), and simultaneous production of four top quarks \\(t\bar{t}t\bar{t}\\). In these samples, the Higgs boson and top quarks decay inclusively. These 12 Higgs and top quark production processes constitute the pretraining dataset.
-
-To test the pretrained model, we further generated four processes including three beyond Standard Model (SM) processes: a SM \\(t\bar{t}H\\) production where the Higgs boson decays exclusively to a pair of photons, a \\(t\bar{t}H\\) production with the Higgs boson decaying to a pair of photons, where the top-Yukawa coupling is CP-odd, implemented using the Higgs Characterization model [Artoisenet et al., 2013](#ref-artoisinet-2013puc), the production of a pair of superpartners of the top quark (s-top) using the Minimal Supersymmetric Standard Model (MSSM) [Rosiek, 1990](#ref-rosiek-1990), [Allanach et al., 2009](#ref-allanach-2009), and flavor changing neutral current (FCNC) processes [Degrande et al., 2015](#ref-degrande-2015), [Durieux et al., 2015](#ref-durieux-2015). For the s-top process, we simulate the production of heavier s-top pairs \\(t_2\bar{t_2}\\), where each heavier s-top (mass 582 GeV) decays into a lighter s-top \\(t_1\\) or \\(\bar{t_1}\\), mass 400 GeV) and a Higgs boson. The FCNC process involves \\(t\bar{t}\\) production where one top quark decays to a Higgs boson and a light quark. We generate 10 million events for each process, except for \\(tHq\\) and \\(t\bar{t}t\bar{t}\\), where 5 million events were produced.
-
-In all simulation samples, the center of mass energy of the proton-proton collision is set to 13 TeV. The Higgs boson, top quarks, and vector bosons are set to decay inclusively (except the \\(t\bar{t}H \rightarrow \gamma\gamma\\) samples), with MadSpin [Artoisenet et al., 2012](#ref-artoisinet-2012st) handling the decays of top quarks and W bosons. The generated events are processed through Pythia 8.235 [Sjostrand et al., 2015](#ref-sjostrand-2015) for parton showering and heavy particle decays, followed by Delphes 3.4.2 [de Favereau et al., 2014](#ref-defavereau-2014) configured to emulate the ATLAS detector [ATLAS Collaboration, 2008](#ref-atlas-2008) for fast detector simulation.
-
-The detector-level object selection criteria are defined to align with typical experimental conditions. Photons are required to have transverse momentum \\(p_T \geq 20~\mathrm{GeV}\\) and pseudorapidity \\(|\eta| \leq 2.37\\), excluding the electromagnetic calorimeter crack region \\(1.37 < |\eta| < 1.52\\). Electrons must have \\(p_T \geq 10~\mathrm{GeV}\\) and \\(|\eta| \leq 2.47\\) (excluding the same crack region), while muons are selected with \\(p_T \geq 10~\mathrm{GeV}\\) and \\(|\eta| \leq 2.7\\). Jets are reconstructed using the anti-\\(k_t\\) algorithm [Cacciari et al., 2008](#ref-cacciari-2008gp) with radius parameter \\(\Delta R=0.4\\), where \\(\Delta R\\) is defined as \\(\sqrt{\Delta\eta ^2 + \Delta\phi^2}\\), with \\(\Delta\eta\\) being the difference in pseudorapidity and \\(\Delta\phi\\) the difference in azimuthal angle. Jets must satisfy \\(p_T \geq 25~\mathrm{GeV}\\) and \\(|\eta| \leq 2.5\\). To avoid double-counting, jets are removed if they are within \\(\Delta R < 0.4\\) of a photon or lepton. The identification of jets originating from b-quark decays (b-tagging) is performed by matching jets within \\(\Delta R = 0.4\\) of a b-quark, with efficiency corrections applied to match the performance of the ATLAS experiment's b-tagging algorithm [ATLAS Collaboration, 2019](#ref-atlas-2019bwq).
-
-## Methods
-
-### Overview
-
-We present a methodology for developing and evaluating a foundation model for particle collision event analysis. The approach centers on pretraining a Graph Neural Network (GNN) architecture using a comprehensive dataset that spans multiple physics tasks, enabling the model to learn robust and transferable features. For task-specific applications, we employ a fine-tuning strategy that combines output layer adaptation with carefully calibrated learning rates for updating the pretrained parameters.
-
-Given the prevalence of classification problems in particle physics data analysis, we evaluate the model's efficacy through a systematic assessment across five binary classification tasks:
-
-- \\(t\bar{t}H(\rightarrow \gamma\gamma)\\) with CP-even versus CP-odd t-H interaction
-- \\(t\bar{t}\\) with FCNC top quark decays versus $tHq$ processes
-- \\(t\bar{t}W\\) versus $ttt$ processes
-- Stop pair production with Higgs bosons in the decay chain versus \\(t\bar{t}H\\) processes
-- \\(WH\\) versus \\(ZH\\) production modes
-
-Our evaluation metrics encompass classification performance, computational efficiency, and model interpretability. The investigation extends to analyzing the model's scaling behavior with respect to training dataset size, benchmarked against models trained without pretraining. Although we explored transfer learning through parameter freezing of pretrained layers, this approach did not yield performance improvements, leading us to focus our detailed analysis on fine-tuning strategies.
-
-This methodological framework demonstrates the potential of foundation models to enhance the efficiency of particle physics analyses while improving task-specific performance, offering a promising direction for future high-energy physics research.
-
----
-
-### GNN Architecture
-
-We implement a Graph Neural Network (GNN) architecture that naturally accommodates the point-cloud structure of particle physics data, employing the DGL framework with a PyTorch backend [Wang et al., 2019][ref-dgl-2019], [Paszke et al., 2019][ref-pytorch-2019]. A fully connected graph is constructed for each event, with nodes corresponding to reconstructed jets, electrons, muons, photons, and \\(\vec{E}_T^{\text{miss}}\\). The features of each node include the four-momentum \\((p_T, \eta, \phi, E)\\) of the object with a massless assumption (\\(E = p_T \cosh \eta\\)), the b-tagging label (for jets), the charge (for leptons), and an integer labeling the type of object represented by the node. We use a placeholder value of 0 for features which are not defined for every node type such as the b-jet tag, lepton charge, or the pseudorapidity of \\(\vec{E}_T^{\text{miss}}\\). We assign the angular distances (\\(\Delta \eta, \Delta \phi, \Delta R\\)) as edge features and the number of nodes $N$ in the graph as a global feature. We denote the node features \\(\{\vec x_i\}\\), edge features \\(\{\vec y_{ij}\}\\), and global features \\(\{\vec z\}\\).
-
-The GNN model is based on the graph network architecture described in [Battaglia et al., 2018][ref-graphnets-2018] using simple multilayer perceptron (MLP) feature functions and summation aggregation. The model is comprised of three primary components: an encoder, the graph network, and a decoder. In the encoder, three MLPs embed the nodes, edges, and global features into a latent space of dimension 64. The graph network block, which is designed to facilitate message passing between different domains of the graph, performs an edge update $f_e$, followed by a node update $f_n$, and finally a global update $f_g$, all defined below. The inputs to each update MLP are concatenated.
-
-$$
-\vec {y'}_{ij} = f_e\left(\{\vec x_k\},\vec y_{ij},\vec z\right) = \mathrm{MLP}\left(\vec x_i,\vec x_j,\vec y_{ij},\vec z\right)
-$$
-
-$$
-\vec{x'}_{i} = f_n\left(\vec x_i,\{\vec{y'}_{jk}\},\vec z\right) = \mathrm{MLP}\left(\vec x_i,\sum_j\vec{y'}_{ij},\vec z\right)
-$$
-
-$$
-\vec{z'} = f_g\left(\{\vec{x'}_i\},\{\vec{y'}_{ij}\},\vec z\right) = \mathrm{MLP}\left(\sum_i\vec{x'}_i,\sum_{i,j}\vec{y'}_{ij},\vec z\right)
-$$
-
-This graph block is iterated four times with the same update MLPs. Finally, the global features are passed through a decoder MLP and a final layer linear to produce the desired model outputs. Each MLP consists of 4 linear layers, each with an output width of 64, with the `ReLU` activation function. The output of the MLP is then passed through a `LayerNorm` layer [Ba et al., 2016][ref-layernorm-2016]. The total number of trainable parameters in this model is about 400,000.
-
-As a performance benchmark, a baseline GNN model is trained from scratch for each classification task. The initial learning rate is set to \\(10^{-4}\\) with an exponential decay following \\(LR(x) = LR_{\text{initial}}\cdot(0.99)^x\\), where \\(x\\) represents the epoch number.
-
----
-
-### Pretraining Strategy
-
-We explore two complementary pretraining approaches to develop robust representations of collision events: (1) multi-class classification, which trains the model to distinguish between different physics processes, and (2) multi-label classification, which predicts the existence and kinematics of heavy particles with prompt decays. The pretraining dataset consists of approximately 120 million events, evenly distributed across 12 distinct physics processes, including all major Higgs boson production mechanisms and top quark processes as described in [Data Samples](#sec-data). This large-scale pretraining effort was conducted on the Perlmutter supercomputer at NERSC.
-
-#### Multi-class Classification
-
-For Monte Carlo simulated events, the underlying physics process that generated each event is known precisely, providing natural labels for supervised learning. However, the challenge lies in the complexity of collision events: different physics processes can produce similar kinematics and event topologies, particularly in certain regions of phase space. No single observable can unambiguously identify the underlying process. By training the model to distinguish between 12 different processes simultaneously, we challenge it to learn subtle differences in kinematics and topology that collectively characterize each process. The model is trained using categorical cross entropy as the loss function. The output layer of the multiclass classification model has 832 trainable parameters.
-
-#### Multi-label Classification
-
-This approach combines both classification and regression tasks to characterize collision events. For discrete properties like particle presence in specific kinematic regions, we employ classification labels with binary cross-entropy loss. For continuous quantities like particle multiplicities, we use regression labels with mean-squared error loss. This hybrid approach enables the model to learn both categorical and continuous aspects of the physics processes simultaneously.
-
-We develop a comprehensive set of 41 labels that capture both particle multiplicities and kinematic properties. This approach increases prediction granularity and enhances model interpretability. By training the model to predict event kinematics rather than event identification, we create a task-independent framework that can potentially generalize better to novel scenarios not seen during pretraining.
-
-The particle multiplicity labels count the number of Higgs bosons (\\(n_{\text{higgs}}\\)), top quarks (\\(n_{\text{tops}}\\)), vector bosons (\\(n_V\\)), \\(W\\) bosons (\\(n_W\\)), and \\(Z\\) bosons (\\(n_Z\\)). The kinematic labels characterize the transverse momentum (\\(p_T\\)), pseudorapidity (\\(\eta\\)), and azimuthal angle (\\(\phi\\)) of Higgs bosons and top quarks through binned classifications.
-
-For Higgs bosons, $p_T$ is categorized into three ranges: (0, 30) GeV, (30, 200) GeV, and (200, \\(\infty\\)) GeV, with the upper range particularly sensitive to potential BSM effects. Similarly, both leading and subleading top quarks have $p_T$ classifications spanning (0, 30) GeV, (30, 300) GeV, and (300, \\(\infty\\)) GeV. When no particle exists within a specific \\(p_T\\) range, the corresponding label is set to \\([0, 0, 0]\\). For all particles, \\(\eta\\) measurements are divided into 4 bins with boundaries at \\([-1.5, 0, 1.5]\\), while \\(\phi\\) measurements use 4 bins with boundaries at \\([-\frac{\pi}{2}, 0, \frac{\pi}{2}]\\). As with \\(p_T\\), both \\(\eta\\) and \\(\phi\\) labels default to \\([0, 0, 0, 0]\\) in the absence of a particle. This comprehensive labeling schema enables fine-grained learning of kinematic distributions and particle multiplicities, essential for characterizing complex collision events.
-
-The loss function combines individual losses from all 41 labels through weighted averaging. Binary cross-entropy is applied to classification labels, while mean-squared error is used for regression labels. The model generates predictions for all labels simultaneously, with individual losses calculated according to their respective types. The final loss is computed as an equally-weighted average across all labels, with weights set to 1 to ensure uniform contribution to the optimization process. The output layer of the multilabel model has 2,688 trainable parameters.
-
-#### Pretraining
-
-During pre-training, the initial learning rate is \\(10^{-4}\\), and the learning rate decays by 1% each epoch following the power law function \\(LR(x) = 10^{-4}\cdot(0.99)^x\\), where \\(x\\) is the number of epochs. Both pre-trained models reach a plateau in loss by epoch 50, at which point the training is stopped.
-
----
-### Fine-tuning Methodology
-
-For downstream tasks, we adjust the model architecture for fine-tuning by replacing the original output layer (final linear layer) with a newly initialized linear layer while retaining the pre-trained weights for all other layers. This modification allows the model to specialize in the specific downstream task while leveraging the general features learned during pretraining.
-
-The fine-tuning process begins with distinct learning rate setups for different parts of the model. The newly initialized linear layer is trained with an initial learning rate of \\(10^{-4}\\), matching the rate used for models trained from scratch. Meanwhile, the pre-trained layers are fine-tuned more cautiously with a lower initial learning rate of \\(10^{-5}\\). This approach ensures that the pre-trained layers adapt gradually without losing their general features, while the new layer learns effectively from scratch. Both learning rates decay over time following the same power law function, \\(LR(x) = LR_{initial} \cdot (0.99)^x\\), to promote stable convergence as training progresses.
-
-We also evaluated a transfer learning setup in which either the decoder MLP or the final linear layer was replaced with a newly initialized component. During this process, all other model parameters remained frozen, leveraging the pre-trained features without further updating them. However, we did not observe performance improvements using the transfer learning setup. Consequently, we focus on reporting results obtained with the fine-tuning approach.
-
----
-
-### Performance Evaluation
-
-We assess model performance using two figures of merit: the classification accuracy and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve. The accuracy is defined as the fraction of correctly classified events when applying a threshold of 0.5 to the neural network output score. Both metrics demonstrate consistent trends in our analysis.
-
-To obtain reliable performance estimates and uncertainties, we employ an ensemble training approach where 5 independent models are trained for each configuration with random weight initialization and random subsets of the training dataset. This enables us to evaluate both the models' sensitivity to initial parameters and to quantify uncertainties in their performance.
-
-To investigate how model performance scales with training data, we conducted training runs using sample sizes ranging from \\(10^3\\) to \\(10^7\\) events per class (\\(10^3\\), \\(10^4\\), \\(10^5\\), \\(10^6\\), and \\(10^7\\)) for each model setup: the from-scratch baseline and models fine-tuned from multi-class or multi-label pretrained models. For the \\(10^7\\) case, only the initialization was randomized due to dataset size limitations. All models were evaluated on the same testing dataset, consisting of 2 million events per class, which remained separate from the training process.
-
-| **Name of Task**     | **Pretraining Task** | \\(10^3\\)         | \\(10^4\\)         | \\(10^5\\)         | \\(10^6\\)         | \\(10^7\\)         |
-|----------------------|----------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
-| **ttH CP Even vs Odd** | Baseline Accuracy     | 56.5 ± 1.1 | 62.2 ± 0.1 | 64.3 ± 0.0 | 65.7 ± 0.0 | 66.2 ± 0.0 |
-|                        | Multiclass (%)        | +4.8 ± 1.1 | +3.4 ± 0.1 | +1.3 ± 0.0 | +0.2 ± 0.0 | −0.0 ± 0.0 |
-|                        | Multilabel (%)        | +2.1 ± 1.2 | +1.9 ± 0.1 | +0.8 ± 0.1 | +0.0 ± 0.0 | −0.1 ± 0.0 |
-| **FCNC vs tHq**        | Baseline Accuracy     | 63.6 ± 0.7 | 67.8 ± 0.4 | 68.4 ± 0.3 | 69.3 ± 0.3 | 67.9 ± 0.0 |
-|                        | Multiclass (%)        | +5.8 ± 0.8 | +1.2 ± 0.4 | +1.4 ± 0.3 | +0.5 ± 0.3 | −0.0 ± 0.0 |
-|                        | Multilabel (%)        | −5.3 ± 0.8 | −1.3 ± 0.4 | +0.9 ± 0.4 | +0.3 ± 0.3 | +0.4 ± 0.1 |
-| **ttW vs ttt**         | Baseline Accuracy     | 75.8 ± 0.1 | 77.6 ± 0.1 | 78.9 ± 0.0 | 79.8 ± 0.0 | 80.3 ± 0.0 |
-|                        | Multiclass (%)        | +3.7 ± 0.1 | +2.7 ± 0.1 | +1.3 ± 0.0 | +0.4 ± 0.0 | +0.0 ± 0.0 |
-|                        | Multilabel (%)        | +2.2 ± 0.1 | +1.1 ± 0.1 | +0.5 ± 0.0 | +0.0 ± 0.0 | −0.1 ± 0.0 |
-| **stop vs ttH**        | Baseline Accuracy     | 83.0 ± 0.2 | 86.3 ± 0.1 | 87.6 ± 0.0 | 88.5 ± 0.0 | 88.8 ± 0.0 |
-|                        | Multiclass (%)        | +0.4 ± 0.2 | +1.9 ± 0.1 | +1.0 ± 0.0 | +0.3 ± 0.0 | +0.0 ± 0.0 |
-|                        | Multilabel (%)        | +2.8 ± 0.2 | +1.0 ± 0.1 | +0.5 ± 0.0 | +0.0 ± 0.0 | −0.0 ± 0.0 |
-| **WH vs ZH**           | Baseline Accuracy     | 51.4 ± 0.1 | 53.9 ± 0.1 | 55.8 ± 0.0 | 57.5 ± 0.0 | 58.0 ± 0.0 |
-|                        | Multiclass (%)        | +5.2 ± 0.1 | +5.3 ± 0.1 | +3.1 ± 0.0 | +0.6 ± 0.0 | +0.1 ± 0.0 |
-|                        | Multilabel (%)        | −1.1 ± 0.1 | −0.9 ± 0.2 | +0.5 ± 0.1 | +0.1 ± 0.0 | −0.1 ± 0.0 |
-
-> **Table 1**: Accuracy of the traditional model versus the accuracy increase due to fine-tuning from various pretraining tasks.  
-> The accuracies are averaged over 5 independently trained models with randomly initialized weights and trained on a random subset of the data. One exception is the \\(10^7\\) training where all models use the same dataset due to limitations on our dataset size. The random subsets are allowed to overlap, but this overlap should be very minimal because all models take an independent random subset of \\(10^7\\) events. The testing accuracy is calculated from the same testing set of 2 million events per class across all models for a specific training task. The errors are the propagated errors (root sum of squares) of the standard deviation of accuracies for each model.
-
-## Results
-
-### Classification Performance
-
-Since the observations of AUC and accuracy show similar trends, we focus the presentation of the results using accuracy here for conciseness in Table 1.
-
-In general, the fine-tuned pretrained model achieves at least the same level of classification performance as the baseline model. Notably, there are significant improvements, particularly when the sample size is small, ranging from \\(10^3\\) to \\(10^4\\) events. In some cases, the accuracy improvements exceed five percentage points, demonstrating that pretrained models provide a strong initial representation that compensates for limited data. The numerical values of the improvements in accuracy may not fully capture the impact on the sensitivity of the measurements for which the neural network classifier is used, and the final sensitivity improvement is likely to be greater.
-
-As the training sample size grows to \\(10^5\\), \\(10^6\\), and eventually \\(10^7\\) events, the added benefit of pretraining diminishes. With abundant data, models trained from scratch approach or even match the accuracy of fine-tuned pretrained models. This suggests that large datasets enable effective learning from scratch, rendering the advantage of pretraining negligible in such scenarios.
-
-Although both pretraining approaches offer benefits, multiclass pretraining tends to provide more consistent improvements across tasks, especially in the low-data regime. In contrast, multilabel pretraining can sometimes lead to neutral or even slightly negative effects for certain tasks and data sizes. This highlights the importance of the pretraining task design, as the similarity between pretraining and fine-tuning tasks in the multiclass approach appears to yield better-aligned representations.
-
-Finally, the spread of accuracy across the five tasks for the baseline model is quite large, offering a robust test of fine-tuning across tasks of varying difficulty. The consistent observation of these trends across tasks confirms the reliability and robustness of the findings.
-
----
-
-### Model Interpretability
-
-We aim to understand whether pretrained and baseline models learn the same underlying representations. If the two models exhibit high similarity, a plausible interpretation is that pretraining provides the pretrained model with an advantageous initialization, allowing it to converge to a similar state as the baseline model more efficiently. Conversely, significant differences between the models would indicate that pretraining facilitates the development of a more general and robust latent space, which serves as a foundation for fine-tuning to effectively adapt to the downstream task. To investigate this, we analyzed the representational similarity between a pretrained model fine-tuned for the downstream task and a baseline model trained directly on the downstream task without pretraining.
-
-We use Centered Kernel Alignment (CKA) [Kornblith et al., 2019][ref-kornblith-2019-cka] to analyze model similarity and interpretability. CKA is a robust metric that quantifies the similarity between the internal representations of neural networks by comparing their feature matrices in a manner that is invariant to scaling, rotation, and alignment. This invariance makes CKA particularly effective for studying relationships between network layers, even across networks of different sizes or those trained from varying initializations.
-
-The similarity is evaluated using a 64-dimensional latent representation after the decoder stage of the GNN model. This choice allows us to compare the internal states of the models at a fine-grained level and understand how training strategies impact the representations directly used for the output task.
-
-To provide an intuitive understanding of CKA values, we construct a table of the CKA scores for various transformations performed on a set of dummy data.
-
-- **A:** randomly initialized matrix with shape (1000, 64), following a normal distribution (\\(\sigma = 1, \mu = 0\\))
-- **B:** matrix with shape (1000, 64) constructed via various transformations performed on \\(A\\)
-- **Noise:** randomly initialized noise matrix with shape (1000, 64), following a normal distribution (\\(\sigma = 1, \mu = 0\\))
-
-| Dataset | CKA Score |
-|---------|-----------|
-| \\(A, B = A\\) | 1.00 |
-| \\(A, B =\\) permutation on columns of \\(A\\) | 1.00 |
-| \\(A, B = A + \mathrm{Noise}(0.1)\\) | 0.99 |
-| \\(A, B = A + \mathrm{Noise}(0.5)\\) | 0.80 |
-| \\(A, B = A + \mathrm{Noise}(0.75)\\) | 0.77 |
-| \\(A, B = A \cdot \mathrm{Noise}(1)\\) (Linear Transformation) | 0.76 |
-| \\(A, B = A + \mathrm{Noise}(1)\\) | 0.69 |
-| \\(A, B = A + \mathrm{Noise}(2)\\) | 0.51 |
-| \\(A, B = A + \mathrm{Noise}(5)\\) | 0.39 |
-
-**Table 2:** CKA scores for a dummy dataset \\(A\\) and \\(B\\), where \\(B\\) is created via various transformations performed on \\(A\\).
-
-As seen in Table 2 and in the definition of the CKA, the CKA score is permutation-invariant. We will use the CKA score to evaluate the similarity between various models and gain insight into the learned representation of detector events in each model (i.e., the information that each model learns).
-
-We train ensembles of models for each training task to observe how the CKA score changes due to the random initialization of our models. The CKA score between two models is then defined to be:
-
-\\[
-CKA(A, B) = \frac{1}{n^2} \sum_i^n \sum_j^n CKA(A_i, B_j)
-\\]
-
-where \\(A_i\\) is the representation learned by the \\(i^{\text{th}}\\) model in an ensemble with \\(n\\) total models. The error in CKA is the standard deviation of \\(CKA(A_i, B_j)\\).
-
-Here we present results for the CKA similarity between the final model in each setup with the final model in the baseline, shown in Table 3.
-
-| Training Task         | Baseline         | Multiclass      | Multilabel      |
-|-----------------------|------------------|-----------------|-----------------|
-| ttH CP Even vs Odd    | 0.94 ± 0.05      | 0.82 ± 0.01     | 0.77 ± 0.06     |
-| FCNC vs tHq           | 0.96 ± 0.03      | 0.76 ± 0.01     | 0.81 ± 0.01     |
-| ttW vs ttt            | 0.91 ± 0.08      | 0.75 ± 0.10     | 0.72 ± 0.05     |
-| stop vs ttH           | 0.87 ± 0.11      | 0.79 ± 0.12     | 0.71 ± 0.08     |
-| WH vs ZH              | 0.90 ± 0.07      | 0.53 ± 0.03     | 0.44 ± 0.06     |
-
-**Table 3:** CKA Similarity of the latent representation before the decoder with the baseline model, averaged over 3 models per training setup, and all models trained with the full dataset (\\(10^7\\)). The baseline column is not guaranteed to be 1.0 because of the random initialization of the model. Each baseline model converges to a slightly different representation as seen in the CKA values in that column.
-
-The baseline models with different initializations exhibit high similarity values, ranging from approximately 0.87 to 0.96, which indicates that independently trained baseline models tend to converge on similar internal representations despite random initialization. Across the considered tasks, models trained as multi-class or multi-label classifiers exhibit noticeably lower CKA similarity scores when compared to the baseline model. For example, in the WH vs ZH task, the baseline model and another baseline trained model have a high similarity of 0.90, whereas the multi-class and multi-label models show significantly reduced similarities (0.53 and 0.44, respectively). This pattern suggests that the representational spaces developed by multi-class or multi-label models differ substantially from those learned by the baseline model that was trained directly on the downstream classification task.
-
-### Computational Efficiency
-
-To estimate the computational resources required for each approach, we measured the wall time needed for a model to reach its final performance. For baseline models, this is defined as the wall time from the start of training until the loss of the model plateaus. For the foundation model approach, the estimate includes both the pretraining time and the fine-tuning time, each measured from the start of training until the loss plateaus. This approach ensures a consistent and comprehensive evaluation of the computational demands.
-
-![The ratio of the fine-tuning time required to achieve 99% of the baseline model's final classification accuracy to the total time spent training the baseline model.](training_time.png)
-*Fig. 1: The ratio of the fine-tuning time required to achieve 99% of the baseline model's final classification accuracy to the total time spent training the baseline model.*
-
-Figure 1 shows the fine-tuning time for the model pretrained with multiclass classification, relative to the time required for the baseline model, as a function of training sample size. In general, the fine-tuning time is significantly shorter than the training time required by the baseline model approach. For smaller training sets, on the order of \\(10^5\\) events, tasks such as FCNC vs. tHq and ttW vs. ttt benefit substantially from the pretrained model’s “head start,” achieving their final performance in only about 1% of the baseline time. For large training datasets, the fine-tuning time relative to the baseline training time becomes larger; however, given that the large training sample typically requires longer training time, fine-tuning still yields much faster training convergence. The ttH CP-even vs. ttH CP-odd task, with a training sample size of \\(10^7\\) events, is an exception where the fine-tuning time exceeds the training time required for the baseline model. This is likely because the processes involved in this task include photon objects in the final states, which are absent from the events used during pretraining.
-
-To accurately evaluate the total time consumption, it is necessary to include the pretraining time required for the foundation model approach. The pretraining times are as follows:
-
-- **Multi-class pretraining:** 45.5 GPU hours
-- **Multi-label pretraining:** 60.0 GPU hours
-
-The GPU hours recorded for the multi-label model represent the total time required when training the model in parallel on 16 GPUs. This includes a model synchronization step, which results in higher GPU hours compared to the multi-class pretraining model.
-
-The foundation model approach becomes increasingly efficient when a large number of tasks are fine-tuned using the same pretrained model, compared to training each task independently from scratch. To illustrate this, we evaluate the computational time required for a scenario where the training sample contains \\(10^7\\) events. For the five tasks tested in this study, the baseline training time (training from scratch) ranges from 1.68 GPU hours (WH vs. ZH) to 5.30 GPU hours (ttW vs. ttt), with an average baseline training time of 2.94 GPU hours. In contrast, the average fine-tuning time for the foundation model approach, relative to the baseline, is 38% of the baseline training time for \\(10^7\\) events. Based on these averages, we estimate that the foundation model approach becomes more computationally efficient than the baseline approach when fine-tuning is performed for more than 41 tasks.
-
-As a practical example, the ATLAS measurement of Higgs boson couplings using the \\(H \rightarrow \gamma\gamma\\) decay channel [ATLAS Collaboration, 2023][ref-atlas-2023-higg] involved training 42 classifiers for event categorization. This coincides with our estimate, suggesting that the foundation model approach can reduce computational costs even for a single high-energy physics measurement.
-
-## Conclusions
-
-We presented an in-depth study of a particle physics foundation model designed to operate on the four-momentum and identification properties of event final-state objects. This model is built on a Graph Neural Network (GNN) architecture and trained on a dataset comprising 120 million simulated proton-proton collision events across 12 distinct physics processes. The pretraining phase explored both multiclass and multilabel classification tasks, providing a robust foundation for downstream applications. Notably, the pretrained models demonstrated significant improvements in event classification performance when fine-tuned, particularly for tasks with limited training samples.
-
-The foundation model approach also offers substantial computational advantages. By leveraging fine-tuning, this methodology reduces the computational resources required for large-scale applications across multiple tasks. Our estimates indicate that significant resource savings can be achieved even for single particle physics measurements, making this approach both scalable and efficient.
-
-To better understand the learned representations of the pretrained model and guide future optimization efforts, we employed a representational similarity evaluation framework using Centered Kernel Alignment (CKA). This metric allowed us to investigate the source of the performance gains observed in the foundation model. Our analysis revealed notable differences in the learned representations between the fine-tuned pretrained model and a baseline model trained from scratch. In deep learning, it is well-established that multiple equally valid solutions can exist. Future studies are necessary to determine whether the low similarity in latent representations reflects complementary information uniquely captured by the foundation and baseline models, or if it can simply be attributed to connected local minima in the loss landscape.
-
-## Acknowledgments
-
-This work is supported by the U.S. National Science Foundation under the Award No. 2046280, and by U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231.
-
-## References
-
-- <span id="ref-openai-2024-gpt4"></span> **OpenAI et al.** GPT-4 Technical Report. arXiv:2303.08774 (2024). [https://arxiv.org/abs/2303.08774](https://arxiv.org/abs/2303.08774)
-
-- <span id="ref-yosinski-2014-transfer"></span> **Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson.** How transferable are features in deep neural networks? CoRR abs/1411.1792 (2014). [http://arxiv.org/abs/1411.1792](http://arxiv.org/abs/1411.1792)
-
-- <span id="ref-rombach-2021-latentdiffusion"></span> **Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.** High-Resolution Image Synthesis with Latent Diffusion Models. CoRR abs/2112.10752 (2021). [https://arxiv.org/abs/2112.10752](https://arxiv.org/abs/2112.10752)
-
-- <span id="ref-podell-2023-sdxl"></span> **Dustin Podell, Zion English, Kyle Lacey et al.** SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 (2023). [https://arxiv.org/abs/2307.01952](https://arxiv.org/abs/2307.01952)
-
-- <span id="ref-jumper-2021-alphafold"></span> **John Jumper, Richard Evans, Alexander Pritzel et al.** Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021). [https://doi.org/10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2)
-
-- <span id="ref-devlin-2018-bert"></span> **Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.** BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). [http://arxiv.org/abs/1810.04805](http://arxiv.org/abs/1810.04805)
-
-- <span id="ref-atlas-2023-higg"></span> **ATLAS Collaboration.** Measurement of the properties of Higgs boson production at \\(\sqrt{s} = 13\,\text{TeV}\\) in the \\(H \to \gamma\gamma\\) channel using \\(139\,\text{fb}^{-1}\\) of \\(pp\\) collision data with the ATLAS experiment. JHEP 07 (2023) 088. [arXiv:2207.00348](https://arxiv.org/abs/2207.00348), [https://doi.org/10.1007/JHEP07(2023)088](https://doi.org/10.1007/JHEP07(2023)088)
-
-- <span id="ref-atlas-2023-4top"></span> **ATLAS Collaboration.** Observation of four-top-quark production in the multilepton final state with the ATLAS detector. Eur. Phys. J. C 83 (2023) 496. [arXiv:2303.15061](https://arxiv.org/abs/2303.15061), [https://doi.org/10.1140/epjc/s10052-023-11573-0](https://doi.org/10.1140/epjc/s10052-023-11573-0)
-
-- <span id="ref-kornblith-2019-cka"></span> **Simon Kornblith, Mohammad Norouzi, Honglak Lee, Geoffrey Hinton.** Similarity of Neural Network Representations Revisited. CoRR abs/1905.00414 (2019). [http://arxiv.org/abs/1905.00414](http://arxiv.org/abs/1905.00414)
-
----
-
-<!-- Historical/General Physics foundational texts -->
-
-- <span id="ref-birell-1982-qfields"></span> **N. D. Birell, P. C. W. Davies.** Quantum Fields in Curved Space. Cambridge Univ. Press (1982).
-
-- <span id="ref-feynman-1954"></span> **R. P. Feynman.** Phys. Rev. 94, 262 (1954).
-
-- <span id="ref-einstein-1935-epr"></span> **A. Einstein, Yu. Podolsky, N. Rosen.** Phys. Rev. 47, 777 (1935).
-
-- <span id="ref-berman-1983-stability"></span> **G. P. Berman, Jr., F. M. Izrailev, Jr.** Stability of nonlinear modes. Physica D 88, 445 (1983).
-
-- <span id="ref-davies-1988-trapped"></span> **E. B. Davies, L. Parns.** Trapped modes in acoustic waveguides. Q. J. Mech. Appl. Math. 51, 477–492 (1988).
-
-- <span id="ref-witten-2001"></span> **Edward Witten.** hep-th/0106109 (2001). [https://arxiv.org/abs/hep-th/0106109](https://arxiv.org/abs/hep-th/0106109)
-
----
-
-<!-- Particle physics/data science foundational models -->
-
-- <span id="ref-beutler-1994-hem"></span> **E. Beutler.** Williams Hematology, 5th Edition, Chapter 7, pp. 654–662. McGraw-Hill, New York (1994).
-
-- <span id="ref-knuth-1973-fa"></span> **Donald E. Knuth.** The Art of Computer Programming vol. 1: Fundamental Algorithms, 2nd Ed., Addison-Wesley (1973).
-
-- <span id="ref-smith-2005-philos"></span> **J. S. Smith, G. W. Johnson.** Philos. Trans. R. Soc. London, Ser. B 777, 1395 (2005).
-
-- <span id="ref-smith-2010-jap-unpub"></span> **W. J. Smith, T. J. Johnson, B. G. Miller.** Surface chemistry and preferential crystal orientation on a silicon surface. J. Appl. Phys. (unpublished, 2010).
-
-- <span id="ref-smith-2010-jap-sub"></span> **V. K. Smith, K. Johnson, M. O. Klein.** Surface chemistry and preferential crystal orientation on a silicon surface. J. Appl. Phys. (submitted, 2010).
-
-- <span id="ref-underwood-1988-lowerbounds"></span> **Ulrich Underwood, Ned Net, Paul Pot.** Lower Bounds for Wishful Research Results. Talk at Fanstord University (1988).
-
-- <span id="ref-johnson-2007-comm"></span> **M. P. Johnson, K. L. Miller, K. Smith.** Personal communication (Jan-May 2007).
-
----
-
-<!-- Prototypical collider software and tools -->
-
-- <span id="ref-pytorch-2019"></span> **Adam Paszke et al.** PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703 (2019). [http://arxiv.org/abs/1912.01703](http://arxiv.org/abs/1912.01703)
-
-- <span id="ref-dgl-2019"></span> **Minjie Wang et al.** Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. arXiv:1909.01315 (2019). [http://arxiv.org/abs/1909.01315](http://arxiv.org/abs/1909.01315)
-
-- <span id="ref-graphnets-2018"></span> **Peter W. Battaglia et al.** Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261 (2018). [http://arxiv.org/abs/1806.01261](http://arxiv.org/abs/1806.01261)
-
-- <span id="ref-layernorm-2016"></span> **Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.** Layer Normalization. arXiv:1607.06450 (2016). [https://arxiv.org/abs/1607.06450](https://arxiv.org/abs/1607.06450)
-
----
-
-<!-- Recent & foundation models in HEP ML -->
-
-- <span id="ref-wildridge-2024-bumblebee"></span> **Andrew J. Wildridge et al.** Bumblebee: Foundation Model for Particle Physics Discovery. arXiv:2412.07867 (2024). [https://arxiv.org/abs/2412.07867](https://arxiv.org/abs/2412.07867)
-
-- <span id="ref-katel-2024-jet"></span> **Subash Katel et al.** Learning Symmetry-Independent Jet Representations via Jet-Based Joint Embedding Predictive Architecture. arXiv:2412.05333 (2024). [https://arxiv.org/abs/2412.05333](https://arxiv.org/abs/2412.05333)
-
-- <span id="ref-araz-2024-pointcloud"></span> **Jack Y. Araz et al.** Point cloud-based diffusion models for the Electron-Ion Collider. arXiv:2410.22421 (2024). [https://arxiv.org/abs/2410.22421](https://arxiv.org/abs/2410.22421)
-
-- <span id="ref-leigh-2024-maskedparticle"></span> **Matthew Leigh et al.** Is Tokenization Needed for Masked Particle Modelling? arXiv:2409.12589 (2024). [https://arxiv.org/abs/2409.12589](https://arxiv.org/abs/2409.12589)
-
-- <span id="ref-mikuni-2024-omnilearn"></span> **Vinicius Mikuni, Benjamin Nachman.** OmniLearn: A Method to Simultaneously Facilitate All Jet Physics Tasks. arXiv:2404.16091 (2024). [https://arxiv.org/abs/2404.16091](https://arxiv.org/abs/2404.16091)
-
-- <span id="ref-zhang-2024-xiwu"></span> **Zhengde Zhang et al.** Xiwu: A Basis Flexible and Learnable LLM for High Energy Physics. arXiv:2404.08001 (2024). [https://arxiv.org/abs/2404.08001](https://arxiv.org/abs/2404.08001)
-
-- <span id="ref-harris-2024-resimulation"></span> **Philip Harris et al.** Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models. arXiv:2403.07066 (2024). [https://arxiv.org/abs/2403.07066](https://arxiv.org/abs/2403.07066)
-
-- <span id="ref-birk-2024-omnijet"></span> **Joschka Birk, Anna Hallin, Gregor Kasieczka.** OmniJet-$\alpha$: the first cross-task foundation model for particle physics. Machine Learning: Science and Technology. 5(3), 035031 (Aug 2024). [https://doi.org/10.1088/2632-2153/ad66ad](https://doi.org/10.1088/2632-2153/ad66ad)
-
-- <span id="ref-huang-2024-lmtracking"></span> **Andris Huang et al.** A Language Model for Particle Tracking. arXiv:2402.10239 (2024). [https://arxiv.org/abs/2402.10239](https://arxiv.org/abs/2402.10239)
-
-- <span id="ref-golling-2024-maskedset"></span> **Tobias Golling et al.** Masked Particle Modeling on Sets: Towards Self-Supervised High Energy Physics Foundation Models. arXiv:2401.13537 (2024). [https://arxiv.org/abs/2401.13537](https://arxiv.org/abs/2401.13537)
-
-- <span id="ref-liu-2023-gaam"></span> **Junze Liu et al.** Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation. Journal of Instrumentation 18(11), P11003 (Nov 2023). [https://doi.org/10.1088/1748-0221/18/11/p11003](https://doi.org/10.1088/1748-0221/18/11/p11003)
-
-- <span id="ref-hashemi-2024-gen"></span> **Baran Hashemi et al.** Ultra-high-granularity detector simulation with intra-event aware generative adversarial network and self-supervised relational reasoning. Nature Communications 15(1) (June 2024). [https://doi.org/10.1038/s41467-024-49104-4](https://doi.org/10.1038/s41467-024-49104-4)
-
-- <span id="ref-vigl-2024-finetune"></span> **Matthias Vigl et al.** Finetuning Foundation Models for Joint Analysis Optimization. arXiv:2401.13536 (2024). [https://arxiv.org/abs/2401.13536](https://arxiv.org/abs/2401.13536)
-
-- <span id="ref-li-2024-refine"></span> **Chen Li, Hao Cai, Xianyang Jiang.** Refine neutrino events reconstruction with BEiT-3. Journal of Instrumentation 19(6), T06003 (Jun 2024). [https://doi.org/10.1088/1748-0221/19/06/t06003](https://doi.org/10.1088/1748-0221/19/06/t06003)
\ No newline at end of file
diff --git a/physicsnemo/configs/config.yaml b/physicsnemo/configs/config.yaml
deleted file mode 100644
index f4ba528f692c8cde29bfb3cad5560510b408f134..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/config.yaml
+++ /dev/null
@@ -1,64 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-3
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "config"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  processor_size: 8
-  hidden_dim_node_encoder: 128
-  hidden_dim_edge_encoder: 128
-  hidden_dim_processor: 128
-  hidden_dim_node_decoder: 128
-  out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/hackathon_data/stats_100K
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/graphs/stats_100K
-  training_dir: ./training_stats_100K/
-
-datasets:
-  - name: ttH_cp_even
-    load_path: ${paths.data_dir}/ttH_NLO.root
-    label: 0
-  - name: ttH_cp_odd
-    load_path: ${paths.data_dir}/ttH_CPodd.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  type: torch.bfloat16
-  particles: ["jet", "ele", "mu", "ph", "MET"]
-  features: ["pt", "eta", "phi", "energy", "btag", "charge", "node_type"]
-  globals: []
-  weights: ""
-  tracking: []
-  step_size: 8192
-  batch_size: 8192
-  train_val_test_split: [0.75, 0.24, 0.01]
\ No newline at end of file
diff --git a/physicsnemo/configs/config_stats_all.yaml b/physicsnemo/configs/config_stats_all.yaml
deleted file mode 100644
index bf7ac89fb86304cf48071dc7856db3da1e4bc057..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/config_stats_all.yaml
+++ /dev/null
@@ -1,65 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-4
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "config_stats_all"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  processor_size: 5
-  hidden_dim_node_encoder: 64
-  hidden_dim_edge_encoder: 64
-  hidden_dim_processor: 64
-  hidden_dim_node_decoder: 64
-  out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/hackathon_data/stats_all
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/graphs/stats_all
-  training_dir: ./training_stats_all/
-
-datasets:
-  - name: ttH_cp_even
-    load_path: ${paths.data_dir}/ttH_NLO.root
-    label: 0
-  - name: ttH_cp_odd
-    load_path: ${paths.data_dir}/ttH_CPodd.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  type: torch.bfloat16
-  particles: ["jet", "ele", "mu", "ph", "MET"]
-  features: ["pt", "eta", "phi", "energy", "btag", "charge", "node_type"]
-  globals: []
-  weights: ""
-  tracking: []
-  step_size: 81920
-  batch_size: 8192
-  train_val_test_split: [0.75, 0.24, 0.01]
-  prebatch: True
\ No newline at end of file
diff --git a/physicsnemo/configs/tHjb_CP_0_vs_45.yaml b/physicsnemo/configs/tHjb_CP_0_vs_45.yaml
deleted file mode 100644
index 2bc0cfd298a5da976ae5c93b80f99e091b7f0a6c..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/tHjb_CP_0_vs_45.yaml
+++ /dev/null
@@ -1,79 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-3
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "config"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  processor_size: 8
-  hidden_dim_node_encoder: 128
-  hidden_dim_edge_encoder: 128
-  hidden_dim_processor: 128
-  hidden_dim_node_decoder: 128
-  global_emb_dim: 128
-  out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/ttHCP/ntuples/v02/preselection/merged_fixed/train/
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/ttHCP/graphs/tHjb_CP_0_vs_45/
-  training_dir: ./training_tHjb_CP_0_vs_45/
-
-datasets:
-  - name: tHjb_cp_0_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_had_scaled.root
-    label: 0
-  - name: tHjb_cp_0_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_lep_scaled.root
-    label: 0
-  - name: tHjb_cp_45_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_45_AF3_had_scaled.root
-    label: 1
-  - name: tHjb_cp_45_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_45_AF3_lep_scaled.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  dtype: torch.bfloat16
-  features:
-    # pt, eta, phi, energy, btag, charge, node_type
-    jet: [m_jet_pt, m_jet_eta, m_jet_phi, CALC_E, m_jet_PCbtag, 0, 0]
-    electron: [m_el_pt, m_el_eta, m_el_phi, CALC_E, 0, m_el_charge, 1]
-    muon: [m_mu_pt, m_mu_eta, m_mu_phi, CALC_E, 0, m_mu_charge, 2]
-    photon: [ph_pt_myy, ph_eta, ph_phi, CALC_E, 0, 0, 3]
-    met: [m_met, 0, m_met_phi, CALC_E, 0, 0, 4]
-  globals: [NUM_NODES]
-  weights: m_weightXlumi
-  tracking: []
-  step_size: 16384
-  batch_size: 16384
-  train_val_test_split: [0.5, 0.25, 0.25]
-  prebatch: 
-    enabled: True
-    chunk_size: 512
\ No newline at end of file
diff --git a/physicsnemo/configs/tHjb_CP_0_vs_90.yaml b/physicsnemo/configs/tHjb_CP_0_vs_90.yaml
deleted file mode 100644
index 55737ff7b99cbd5b8fcceb41cc94e2e115e63333..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/tHjb_CP_0_vs_90.yaml
+++ /dev/null
@@ -1,87 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-3
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "tHjb_CP_0_vs_90"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  module: models.MeshGraphNet
-  class: MeshGraphNet
-  args:
-    base_gnn:
-      input_dim_nodes: 7
-      input_dim_edges: 3
-      output_dim: 128
-      processor_size: 8
-      hidden_dim_node_encoder: 128
-      hidden_dim_edge_encoder: 128
-      hidden_dim_processor: 128
-      hidden_dim_node_decoder: 128
-    global_emb_dim: 128
-    global_feat_dim: 1
-    out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/ttHCP/ntuples/v02/preselection/merged_fixed/train/
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/ttHCP/graphs/tHjb_CP_0_vs_90/
-  training_dir: ./tHjb_CP_0_vs_90/
-
-datasets:
-  - name: tHjb_cp_0_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_had_scaled.root
-    label: 0
-  - name: tHjb_cp_0_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_lep_scaled.root
-    label: 0
-  - name: tHjb_cp_90_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_had_scaled.root
-    label: 1
-  - name: tHjb_cp_90_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_lep_scaled.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  dtype: torch.bfloat16
-  features:
-    # pt, eta, phi, energy, btag, charge, node_type
-    jet: [m_jet_pt, m_jet_eta, m_jet_phi, CALC_E, m_jet_PCbtag, 0, 0]
-    electron: [m_el_pt, m_el_eta, m_el_phi, CALC_E, 0, m_el_charge, 1]
-    muon: [m_mu_pt, m_mu_eta, m_mu_phi, CALC_E, 0, m_mu_charge, 2]
-    photon: [ph_pt_myy, ph_eta, ph_phi, CALC_E, 0, 0, 3]
-    met: [m_met, 0, m_met_phi, CALC_E, 0, 0, 4]
-  globals: [NUM_NODES]
-  weights: 1
-  tracking: []
-  step_size: 16384
-  batch_size: 16384
-  train_val_test_split: [0.5, 0.25, 0.25]
-  prebatch: 
-    enabled: True
-    chunk_size: 512
\ No newline at end of file
diff --git a/physicsnemo/configs/tHjb_CP_0_vs_90_edge_network.yaml b/physicsnemo/configs/tHjb_CP_0_vs_90_edge_network.yaml
deleted file mode 100644
index 1fd6cfa0078254c3f817aad2736b0cd864fb672a..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/tHjb_CP_0_vs_90_edge_network.yaml
+++ /dev/null
@@ -1,82 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-3
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "tHjb_CP_0_vs_90_edge_network"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  module: models.Edge_Network
-  class: Edge_Network
-  args:
-    input_dim_nodes: 7
-    input_dim_edges: 3
-    input_dim_globals: 1
-    hid_size: 64
-    n_layers: 4
-    n_proc_steps: 4
-    out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/ttHCP/ntuples/v02/preselection/merged_fixed/train/
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/ttHCP/graphs/tHjb_CP_0_vs_90/
-  training_dir: ./tHjb_CP_0_vs_90_edge_network/
-
-datasets:
-  - name: tHjb_cp_0_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_had_scaled.root
-    label: 0
-  - name: tHjb_cp_0_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_lep_scaled.root
-    label: 0
-  - name: tHjb_cp_90_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_had_scaled.root
-    label: 1
-  - name: tHjb_cp_90_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_lep_scaled.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  dtype: torch.bfloat16
-  features:
-    # pt, eta, phi, energy, btag, charge, node_type
-    jet: [m_jet_pt, m_jet_eta, m_jet_phi, CALC_E, m_jet_PCbtag, 0, 0]
-    electron: [m_el_pt, m_el_eta, m_el_phi, CALC_E, 0, m_el_charge, 1]
-    muon: [m_mu_pt, m_mu_eta, m_mu_phi, CALC_E, 0, m_mu_charge, 2]
-    photon: [ph_pt_myy, ph_eta, ph_phi, CALC_E, 0, 0, 3]
-    met: [m_met, 0, m_met_phi, CALC_E, 0, 0, 4]
-  globals: [NUM_NODES]
-  weights: 1
-  tracking: []
-  step_size: 16384
-  batch_size: 16384
-  train_val_test_split: [0.5, 0.25, 0.25]
-  prebatch: 
-    enabled: True
-    chunk_size: 512
\ No newline at end of file
diff --git a/physicsnemo/configs/tHjb_CP_0_vs_90_globals.yaml b/physicsnemo/configs/tHjb_CP_0_vs_90_globals.yaml
deleted file mode 100644
index 546a69e7209db96d5b783eb60d09a0612b13b483..0000000000000000000000000000000000000000
--- a/physicsnemo/configs/tHjb_CP_0_vs_90_globals.yaml
+++ /dev/null
@@ -1,84 +0,0 @@
-# ignore_header_test
-# Copyright 2023 Stanford University
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-random_seed: 2
-
-scheduler:
-  lr: 1.E-3
-  lr_decay: 1.E-3
-
-training:
-  epochs: 100
-  
-checkpoints:
-  ckpt_path: "checkpoints"
-  ckpt_name: "tHjb_CP_0_vs_90_globals"
-
-performance:
-  amp: False
-  jit: False
-
-architecture:
-  base_gnn:
-    input_dim_nodes: 7
-    input_dim_edges: 3
-    output_dim: 128
-    processor_size: 8
-    hidden_dim_node_encoder: 128
-    hidden_dim_edge_encoder: 128
-    hidden_dim_processor: 128
-    hidden_dim_node_decoder: 128
-  global_emb_dim: 128
-  global_feat_dim: 5
-  out_dim: 1
-
-paths:
-  data_dir: /global/cfs/projectdirs/atlas/joshua/ttHCP/ntuples/v02/preselection/merged_fixed/train/
-  save_dir: /pscratch/sd/j/joshuaho/physicsnemo/ttHCP/graphs/tHjb_CP_0_vs_90_globals/
-  training_dir: ./tHjb_CP_0_vs_90_globals/
-
-datasets:
-  - name: tHjb_cp_0_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_had_scaled.root
-    label: 0
-  - name: tHjb_cp_0_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_0_AF3_lep_scaled.root
-    label: 0
-  - name: tHjb_cp_90_had
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_had_scaled.root
-    label: 1
-  - name: tHjb_cp_90_lep
-    load_path: ${paths.data_dir}/merged_aMCPy8_tHjb125_CP_90_AF3_lep_scaled.root
-    label: 1
-
-root_dataset:
-  ttree: output
-  dtype: torch.bfloat16
-  features:
-    # pt, eta, phi, energy, btag, charge, node_type
-    jet: [m_jet_pt, m_jet_eta, m_jet_phi, CALC_E, m_jet_PCbtag, 0, 0]
-    electron: [m_el_pt, m_el_eta, m_el_phi, CALC_E, 0, m_el_charge, 1]
-    muon: [m_mu_pt, m_mu_eta, m_mu_phi, CALC_E, 0, m_mu_charge, 2]
-    photon: [ph_pt_myy, ph_eta, ph_phi, CALC_E, 0, 0, 3]
-    met: [m_met, 0, m_met_phi, CALC_E, 0, 0, 4]
-  globals: [NUM_NODES, eta_H, pt_H, eta_recotop1, pT_recotop1]
-  weights: 1
-  tracking: []
-  step_size: 16384
-  batch_size: 16384
-  train_val_test_split: [0.5, 0.25, 0.25]
-  prebatch: 
-    enabled: True
-    chunk_size: 512
\ No newline at end of file
diff --git a/physicsnemo/dataset/Dataset.py b/physicsnemo/dataset/Dataset.py
deleted file mode 100644
index 107f83e80091a2133ef32f8f8c0b20af1096cd94..0000000000000000000000000000000000000000
--- a/physicsnemo/dataset/Dataset.py
+++ /dev/null
@@ -1,243 +0,0 @@
-import os
-import uproot
-import dgl
-import torch
-import numpy as np
-from omegaconf import DictConfig
-from typing import List
-from concurrent.futures import ProcessPoolExecutor, as_completed
-from tqdm import tqdm
-
-from dataset import GraphBuilder
-from dataset import Graphs
-from dataset import Normalization
-
-from dgl.dataloading import GraphDataLoader
-
-class Dataset:
-    def __init__(
-        self,
-        name: str,
-        label: int,
-        load_path: str,
-        save_path: str,
-        dtype: torch.dtype,
-        device: str,
-        cfg: DictConfig
-    ):
-        self.name = name
-        self.label = label
-        self.load_path = load_path
-        self.save_path = save_path
-        self.dtype = dtype
-        self.data = None
-        self.device = device
-
-        self.ttree = cfg.ttree
-        self.features = cfg.features
-        self.weights = cfg.weights
-        self.globals = cfg.globals
-        self.tracking = cfg.tracking
-        self.step_size = cfg.step_size
-        self.batch_size = cfg.batch_size
-
-        self.prebatch = cfg.get('prebatch', {'enabled': False})
-
-        self.train_val_test_split = cfg.train_val_test_split
-        assert np.sum(self.train_val_test_split) == 1, "train_val_test_split must sum to 1"
-
-        print(f"initializing dataset {name} with dtype {self.dtype}")
-
-    def get_branches(self) -> List[str]:
-        node_branches = [
-            branches
-            for particle in self.features.values()
-            for branches in particle
-            if isinstance(branches, str) and (branches != "CALC_E" or branches != "NUM_NODES")
-        ]
-        global_branches = [x for x in self.globals if isinstance(x, str)]
-        weight_branch = [self.weights] if isinstance(self.weights, str) else []
-        tracking_branches = [x for x in self.tracking if isinstance(x, str)]
-        label_branch = [self.label] if isinstance(self.label, str) else []
-
-        return node_branches + global_branches + weight_branch + tracking_branches + label_branch
-
-    def process(self):
-        branches = self.get_branches()
-        with uproot.open(f"{self.load_path}:{self.ttree}") as tree:
-            available_branches = set(tree.keys())
-            num_entries = tree.num_entries
-
-        print(f"getting branches: {branches}")
-        
-        num_cpus = os.cpu_count()
-        total_chunks = np.ceil(num_entries / self.step_size)
-
-        with ProcessPoolExecutor(max_workers=num_cpus) as executor:
-            futures = []
-
-            with tqdm(
-                uproot.iterate(
-                        f"{self.load_path}:{self.ttree}",
-                        expressions=[b for b in branches if b in available_branches],
-                        step_size=self.step_size,
-                        library="ak"
-                    ),
-                desc="loading root file",
-                total=total_chunks,
-                position=0,
-                leave=True
-            ) as pbar:
-                
-                for chunk_id, arrays in enumerate(pbar):
-
-                    cfg = GraphBuilder.ChunkConfig(
-                        name=self.name,
-                        label=self.label,
-                        chunk_id=chunk_id,
-                        batch_size=self.batch_size,
-                        arrays=arrays,
-                        features=self.features,
-                        globals=self.globals,
-                        tracking=self.tracking,
-                        weights=self.weights,
-                        branches=branches,
-                        dtype=self.dtype,
-                        save_path=self.save_path,
-                        prebatch = self.prebatch,
-                    )
-                    
-                    futures.append(executor.submit(GraphBuilder.process_chunk, cfg))
-
-        for idx, future in enumerate(as_completed(futures)):
-            try:
-                future.result()
-            except Exception as e:
-                import traceback
-                print(f"exception in chunk: {idx}")
-                traceback.print_exception(type(e), e, e.__traceback__)
-        return
-    
-    def load(self):
-        with uproot.open(f"{self.load_path}:{self.ttree}") as tree:
-            num_entries = tree.num_entries
-        total_chunks = int(np.ceil(num_entries / self.step_size))
-        
-        chunk_files = [f"{self.save_path}/{self.name}_{chunk_id:04d}.bin" for chunk_id in range(total_chunks)]
-        if not all(os.path.exists(f) for f in chunk_files):
-            print("graphs not found. processing root file...")
-            self.process()
-
-        graph_tuple_list = []
-
-        for chunk_id, f in enumerate(chunk_files):
-            if chunk_id < total_chunks - 1:
-                if (self.prebatch.enabled):
-                    n_graphs = self.step_size // self.prebatch.chunk_size
-                else:
-                    n_graphs = self.step_size
-            else:
-                if (self.prebatch.enabled):
-                    n_graphs = (num_entries - self.step_size * (total_chunks - 1)) // self.prebatch.chunk_size + 1
-                else:
-                    n_graphs = num_entries - self.step_size * (total_chunks - 1)
-            graph_tuple_list.extend((f, idx) for idx in range(n_graphs))
-        
-        split = self.train_val_test_split
-        n_total = len(graph_tuple_list)
-        n_train = int(split[0] * n_total)
-        n_val = int(split[1] * n_total)
-
-        train_tuples = graph_tuple_list[:n_train]
-        val_tuples   = graph_tuple_list[n_train:n_train + n_val]
-        test_tuples  = graph_tuple_list[n_train + n_val:]
-        return train_tuples, val_tuples, test_tuples
-    
-class GraphTupleDataset:
-    def __init__(self, tuple_list, stats):
-        self.tuple_list = tuple_list
-        self.stats = stats
-        self.cache = {}
-
-    def __len__(self):
-        return len(self.tuple_list)
-
-    def __getitem__(self, idx):
-        f, graph_idx = self.tuple_list[idx]
-        if f in self.cache:
-            g = self.cache[f]
-        else: 
-            g = Graphs.load_graphs(f)
-            g.normalize(self.stats)
-            self.cache[f] = g
-        return g[graph_idx]
-
-    @staticmethod
-    def collate_fn(samples):
-        all_graphs = []
-        all_metadata = {}
-
-        # Initialize keys in all_metadata from the first sample
-        for k in samples[0][1]:
-            all_metadata[k] = []
-
-        for graph, metadata in samples:
-            all_graphs.append(graph)
-            for k, v in metadata.items():
-                all_metadata[k].append(v)
-
-        # Stack or concatenate metadata for each key
-        for k in all_metadata:
-            # If v is a tensor, stack or cat as appropriate
-            # Use torch.cat if v is already [N, ...] (e.g. labels, features)
-            # Use torch.stack if v is scalar or needs new dimension
-            try:
-                all_metadata[k] = torch.cat(all_metadata[k], dim=0)
-            except Exception:
-                all_metadata[k] = torch.stack(all_metadata[k], dim=0)
-
-        batched_graph = dgl.batch(all_graphs)
-        return batched_graph, all_metadata
-        
-def get_dataset(cfg: DictConfig, device):
-
-    all_train = []
-    all_val = []
-    all_test = []
-
-    dtype_str = getattr(cfg.root_dataset, "dtype", "torch.float32")
-    if isinstance(dtype_str, str) and dtype_str.startswith("torch."):
-        dtype = getattr(torch, dtype_str.split(".")[-1], torch.float32)
-    else:
-        dtype = torch.float32
-
-    for ds in cfg.datasets:
-        name = ds['name']
-        load_path = ds.get('load_path', f"{cfg.paths.data_dir}/{name}.root")
-        save_path = ds.get('save_path', f"{cfg.paths.save_dir}/")
-        datastet = Dataset(name, ds.get('label'), load_path, save_path, dtype, device, cfg.root_dataset)
-        train, val, test = datastet.load()
-        all_train.extend(train)
-        all_val.extend(val)
-        all_test.extend(test)
-
-    stats = Normalization.global_stats(f"{cfg.paths.save_dir}/stats/", dtype=dtype)
-
-    train_dataset = GraphTupleDataset(all_train, stats)
-    val_dataset = GraphTupleDataset(all_val, stats)
-    test_dataset = GraphTupleDataset(all_test, stats)
-
-    if (cfg.root_dataset.get('prebatch', False)):
-        batch_size = cfg.root_dataset.batch_size // cfg.root_dataset.prebatch.chunk_size
-        collate_fn = GraphTupleDataset.collate_fn
-    else:
-        batch_size = cfg.root_dataset.batch_size
-        collate_fn = None
-    
-    train_loader = GraphDataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=5, drop_last=False, collate_fn=collate_fn)
-    val_loader   = GraphDataLoader(val_dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=5, drop_last=False, collate_fn=collate_fn)
-    test_loader  = GraphDataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=0, drop_last=False, collate_fn=collate_fn)
-
-    print("all data loaded successfully")
-    print(f"train: {len(train_dataset)}, val: {len(val_dataset)}, test: {len(test_dataset)}")
-    return train_loader, val_loader, test_loader
\ No newline at end of file
diff --git a/physicsnemo/dataset/GraphBuilder.py b/physicsnemo/dataset/GraphBuilder.py
deleted file mode 100644
index 2a21f04e58203177d680a9259af19086f28c9489..0000000000000000000000000000000000000000
--- a/physicsnemo/dataset/GraphBuilder.py
+++ /dev/null
@@ -1,162 +0,0 @@
-import dgl
-import torch
-import numpy as np
-import awkward as ak
-from dataclasses import dataclass
-from typing import List, Any, Union
-
-from dataset.Graphs import Graphs, save_graphs
-from dataset import Normalization
-
-@dataclass
-class ChunkConfig:
-    name:           str
-    label:          Union[str, int]
-    chunk_id:       int
-    batch_size:     int
-    arrays:         List[Any]
-    features:       List[Any]
-    globals:        List[Any]
-    weights:        Union[str, float]
-    tracking:       List[Any]
-    branches:       List[Any]
-    dtype:          torch.dtype
-    save_path:      str
-    prebatch:       dict
-
-def process_chunk(cfg: ChunkConfig):
-    # Collect everything as lists first
-    graph_list = []
-    meta_dict = {
-        'globals': [],
-        'label': [],
-        'weight': [],
-        'tracking': [],
-        'batch_num_nodes': [],
-        'batch_num_edges': [],
-    }
-
-    for i in range(len(cfg.arrays)):
-        g, meta = process_single_entry(cfg, i)
-        graph_list.append(g)
-        for k in meta_dict:
-            meta_dict[k].append(meta[k])
-
-    # Stack all metadata fields into tensors
-    for k in meta_dict:
-        meta_dict[k] = torch.stack(meta_dict[k])
-
-    graphs = Graphs(graphs=graph_list, metadata=meta_dict)
-    Normalization.save_stats(graphs, f"{cfg.save_path}/stats/{cfg.name}_{cfg.chunk_id:04d}.json")
-
-    if getattr(cfg.prebatch, "enabled", False):
-        graphs.shuffle()
-        graphs.batch(cfg.prebatch["chunk_size"])
-
-    save_graphs(graphs, f"{cfg.save_path}/{cfg.name}_{cfg.chunk_id:04d}.bin")
-
-def process_single_entry(cfg, i):
-    # 1) node features
-    node_features: List[torch.Tensor] = []
-
-    for particle, branch_list in cfg.features.items():
-        feature_tensors: List[torch.Tensor] = []
-        for branch in branch_list:
-            if branch == "CALC_E":
-                pT  = feature_tensors[0]
-                eta = feature_tensors[1]
-                val = pT * torch.cosh(eta)
-            elif isinstance(branch, str):
-                arr = cfg.arrays[branch][i]
-                val = torch.from_numpy(ak.to_numpy(arr)).to(cfg.dtype)
-            else:
-                length = feature_tensors[0].shape[0]
-                val = torch.full((length,), float(branch), dtype=cfg.dtype)
-            feature_tensors.append(val)
-
-        if feature_tensors and feature_tensors[0].numel() > 0:
-            block = torch.stack(feature_tensors, dim=1)
-            node_features.append(block)
-
-    node_features = torch.cat(node_features, dim=0) if node_features else torch.empty((0, len(cfg.features)), dtype=cfg.dtype)
-
-    # 2) global features
-    global_feat_list: List[torch.Tensor] = []
-    for b in cfg.globals:
-        if b == "NUM_NODES":
-            global_feat_list.append(torch.tensor([len(node_features)], dtype=cfg.dtype))
-        else:
-            arr = cfg.arrays[b][i]
-            global_feat_list.append(torch.from_numpy(ak.to_numpy(arr)).to(cfg.dtype))
-    global_feat = torch.cat(global_feat_list, dim=0) if global_feat_list else torch.zeros((1,), dtype=cfg.dtype)
-
-    # 3) tracking
-    tracking_list: List[torch.Tensor] = []
-    for b in cfg.tracking:
-        arr = cfg.arrays[b][i]
-        tracking_list.append(torch.from_numpy(ak.to_numpy(arr)).to(cfg.dtype))
-    tracking = torch.cat(tracking_list, dim=0) if tracking_list else torch.zeros((1,), dtype=cfg.dtype)
-
-    # 4) weight
-    weight = float(cfg.arrays[cfg.weights][i]) if isinstance(cfg.weights, str) else cfg.weights
-    weight = torch.tensor(weight, dtype=cfg.dtype)
-
-    # 5) label
-    label = float(cfg.arrays[cfg.label][i]) if isinstance(cfg.label, str) else cfg.label
-    label = torch.tensor(label, dtype=cfg.dtype)
-
-    # 6) make the DGLGraph
-    g = make_graph(node_features, dtype=cfg.dtype)
-
-    # 7) batch_num_nodes and batch_num_edges
-    batch_num_nodes = g.batch_num_nodes()
-    batch_num_edges = g.batch_num_edges()
-
-    meta = {
-        'globals': global_feat,
-        'label': label,
-        'weight': weight,
-        'tracking': tracking,
-        'batch_num_nodes': batch_num_nodes,
-        'batch_num_edges': batch_num_edges,
-    }
-    return g, meta
-
-src_dst_cache = {}
-def get_src_dst(num_nodes):
-    if num_nodes not in src_dst_cache:
-        src, dst = torch.meshgrid(torch.arange(num_nodes), torch.arange(num_nodes), indexing='ij')
-        src_dst_cache[num_nodes] = (src.flatten(), dst.flatten())
-    return src_dst_cache[num_nodes]
-
-@torch.jit.script
-def compute_edge_features(eta, phi, src, dst):
-    deta = eta[src] - eta[dst]
-    dphi = phi[src] - phi[dst]
-    dphi = torch.remainder(dphi + np.pi, 2 * np.pi) - np.pi
-    dR = torch.sqrt(deta ** 2 + dphi ** 2)
-    edge_features = torch.stack([dR, deta, dphi], dim=1)
-    return edge_features
-
-def make_graph(node_features: torch.tensor, dtype=torch.float32):
-
-    num_nodes = node_features.shape[0]
-    if num_nodes == 0:
-        g = dgl.graph(([], []))
-        g.ndata['features'] = node_features
-        g.edata['features'] = torch.empty((0, 3), dtype=dtype)
-        g.globals = torch.tensor([0], dtype=dtype)
-        return g
-    
-    src, dst = get_src_dst(num_nodes)
-    src = src.flatten()
-    dst = dst.flatten()
-    g = dgl.graph((src, dst))
-    g.ndata['features'] = node_features
-
-    eta = node_features[:, 1]
-    phi = node_features[:, 2]
-    edge_features = compute_edge_features(eta, phi, src, dst)
-    g.edata['features'] = edge_features
-
-    return g
\ No newline at end of file
diff --git a/physicsnemo/dataset/Graphs.py b/physicsnemo/dataset/Graphs.py
deleted file mode 100644
index bcbd359e240e2a7778581cf7e3c9af3f5b70ab4d..0000000000000000000000000000000000000000
--- a/physicsnemo/dataset/Graphs.py
+++ /dev/null
@@ -1,88 +0,0 @@
-import dgl
-import torch
-from dataclasses import dataclass, field
-from typing import List, Dict
-
-@dataclass
-class Graphs:
-    graphs: List[dgl.DGLGraph]
-    metadata: Dict[str, torch.Tensor]
-
-    def __len__(self):
-        return len(self.graphs)
-
-    def __getitem__(self, idx):
-        meta = {k: v[idx] for k, v in self.metadata.items()}
-        return self.graphs[idx], meta
-
-    def shuffle(self):
-        idx = torch.randperm(len(self.graphs))
-        self.graphs = [self.graphs[i] for i in idx]
-        for k in self.metadata:
-            self.metadata[k] = self.metadata[k][idx]
-
-    def batch(self, batch_size, node_feature_dim=None, dtype=None):
-        """
-        In-place batching: after this, self.graphs is a list of batched DGLGraphs,
-        and self.metadata[k] is a tensor of shape [num_batches, batch_size, ...].
-        """
-        batched_graphs = []
-        batched_meta = {k: [] for k in self.metadata}
-        N = len(self.graphs)
-
-        # Infer node_feature_dim and dtype if not specified
-        if node_feature_dim is None and N > 0:
-            feats = self.graphs[0].ndata['features']
-            node_feature_dim = feats.shape[1] if feats.ndim > 1 else 1
-        if dtype is None and N > 0:
-            dtype = self.graphs[0].ndata['features'].dtype
-
-        for start in range(0, N, batch_size):
-            end = start + batch_size
-            batch_graphs = self.graphs[start:end]
-            batch_meta = {k: v[start:end] for k, v in self.metadata.items()}
-
-            # Padding if needed
-            pad_count = batch_size - len(batch_graphs)
-            if pad_count > 0:
-                dummy_graph = dgl.graph(([], []))
-                dummy_graph.ndata['features'] = torch.empty((0, node_feature_dim), dtype=dtype)
-                dummy_graph.edata['features'] = torch.empty((0, 3), dtype=dtype)  # assuming 3 edge features
-                batch_graphs += [dummy_graph] * pad_count
-
-                # Pad metadata with zeros
-                for k, v in batch_meta.items():
-                    shape = list(v[0].shape) if len(v) > 0 else []
-                    pad_tensor = torch.zeros([pad_count] + shape, dtype=v.dtype, device=v.device)
-                    batch_meta[k] = torch.cat([v, pad_tensor], dim=0)
-            else:
-                for k, v in batch_meta.items():
-                    batch_meta[k] = torch.stack(v, dim=0) if isinstance(v, list) else v
-
-            batched_graphs.append(dgl.batch(batch_graphs))
-            for k in batched_meta:
-                batched_meta[k].append(batch_meta[k])
-
-        # Now stack along a new axis: [num_batches, batch_size, ...]
-        for k in batched_meta:
-            self.metadata[k] = torch.stack(batched_meta[k], dim=0)
-
-        self.graphs = batched_graphs
-
-    def normalize(self, stats):
-        node_mean, node_std, _ = stats['node']
-        edge_mean, edge_std, _ = stats['edge']
-        for g in self.graphs:
-            g.ndata['features'] = (g.ndata['features'] - node_mean) / node_std
-            g.edata['features'] = (g.edata['features'] - edge_mean) / edge_std
-
-def save_graphs(graphs: Graphs, f: str):
-    meta_to_save = {k: v for k, v in graphs.metadata.items()}
-    dgl.save_graphs(f, graphs.graphs, meta_to_save)
-
-def load_graphs(f: str) -> Graphs:
-    g, meta = dgl.load_graphs(f)
-    for k in meta:
-        if not isinstance(meta[k], torch.Tensor):
-            meta[k] = torch.stack(meta[k])
-    return Graphs(graphs=g, metadata=meta)
\ No newline at end of file
diff --git a/physicsnemo/dataset/Normalization.py b/physicsnemo/dataset/Normalization.py
deleted file mode 100644
index ac168f5f41242af21c1018fc71051cd43f7ba643..0000000000000000000000000000000000000000
--- a/physicsnemo/dataset/Normalization.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import torch
-import json
-import os
-from dataset.Graphs import Graphs
-from typing import List, Dict, Tuple
-
-def combine_feature_stats(chunks: List[Dict]) -> Tuple[torch.Tensor, torch.Tensor, int]:
-    """
-    Combine mean/std/count from multiple chunks using Welford's algorithm.
-    Returns combined mean, std, and total count.
-    """
-    n_total = 0
-    mean_total = None
-    M2_total = None
-
-    for chunk in chunks:
-        n_k = chunk['count']
-        if n_k == 0:
-            continue
-
-        mean_k = torch.tensor(chunk['mean'])
-        std_k = torch.tensor(chunk['std'])
-        M2_k = (std_k ** 2) * n_k
-
-        if n_total == 0:
-            mean_total = mean_k
-            M2_total = M2_k
-            n_total = n_k
-        else:
-            delta = mean_k - mean_total
-            N = n_total + n_k
-            mean_total += delta * (n_k / N)
-            M2_total += M2_k + (delta ** 2) * (n_total * n_k / N)
-            n_total = N
-
-    if n_total == 0:
-        return torch.tensor([]), torch.tensor([]), 0
-
-    std_total = torch.sqrt(M2_total / n_total)
-    return mean_total, std_total, n_total
-
-def global_stats(dirpath: str, dtype: torch.dtype) -> Dict[str, Tuple[torch.Tensor, torch.Tensor, int]]:
-    """
-    Load all JSON stats files in a directory, combine node, edge, and global stats,
-    and optionally save the combined stats as JSON to `save_path`.
-    """
-
-    combined_stats_path = os.path.join(dirpath, "global_stats.json")
-
-    if not os.path.exists(combined_stats_path):
-        stats_list = []
-        for fname in os.listdir(dirpath):
-            if fname.endswith('.json'):
-                with open(os.path.join(dirpath, fname), 'r') as f:
-                    stats_list.append(json.load(f))
-
-        node_stats = [s['node'] for s in stats_list]
-        edge_stats = [s['edge'] for s in stats_list]
-
-        combined = {
-            'node': combine_feature_stats(node_stats),
-            'edge': combine_feature_stats(edge_stats),
-        }
-
-        combined_json = {}
-        for key, (mean, std, count) in combined.items():
-            combined_json[key] = {
-                'mean': mean.tolist() if mean.numel() > 0 else [],
-                'std': std.tolist() if std.numel() > 0 else [],
-                'count': count,
-            }
-
-        with open(combined_stats_path, 'w') as f:
-            json.dump(combined_json, f, indent=4)
-
-    with open(combined_stats_path, 'r') as f:
-        combined_json = json.load(f)
-
-    def to_tensor(d):
-        mean = torch.tensor(d['mean'], dtype=dtype) if d['mean'] else torch.tensor([], dtype=dtype)
-        std = torch.tensor(d['std'], dtype=dtype) if d['std'] else torch.tensor([], dtype=dtype)
-        count = d['count']
-        return mean, std, count
-
-    return {
-        'node': to_tensor(combined_json['node']),
-        'edge': to_tensor(combined_json['edge']),
-    }
-
-def compute_stats(feats, eps=1e-6):
-    mean = feats.mean(dim=0)
-    if feats.size(0) > 1:
-        var = ((feats - mean) ** 2).mean(dim=0)
-    else:
-        var = torch.zeros_like(mean)
-    std = torch.sqrt(var)
-    std = torch.where(std < eps, torch.full_like(std, eps), std)
-
-    return mean, std
-
-def save_stats(graphs: 'Graphs', filepath: str, categorical_unique_threshold=50):
-    """
-    Compute and save normalization stats (mean, std, counts) for node and edge features.
-    Categorical features (few unique values) have normalization disabled (mean=0, std=1).
-    """
-    if len(graphs) == 0:
-        raise ValueError("No graphs to compute stats from.")
-
-    # Node and edge features
-    all_node_feats = torch.cat([g.ndata['features'] for g, _ in graphs], dim=0)
-    all_edge_feats = torch.cat([g.edata['features'] for g, _ in graphs], dim=0)
-
-    counts = {
-        'node': all_node_feats.size(0),
-        'edge': all_edge_feats.size(0),
-    }
-
-    node_mean, node_std = compute_stats(all_node_feats)
-    edge_mean, edge_std = compute_stats(all_edge_feats)
-
-    categorical_mask = torch.tensor([
-        torch.unique(all_node_feats[:, i]).numel() < categorical_unique_threshold
-        for i in range(node_mean.size(0))
-    ], dtype=torch.bool)
-    node_mean[categorical_mask] = 0.0
-    node_std[categorical_mask] = 1.0
-
-    stats = {
-        'node': {
-            'mean': node_mean.tolist(),
-            'std': node_std.tolist(),
-            'count': counts['node'],
-        },
-        'edge': {
-            'mean': edge_mean.tolist(),
-            'std': edge_std.tolist(),
-            'count': counts['edge'],
-        },
-    }
-
-    os.makedirs(os.path.dirname(filepath), exist_ok=True)
-
-    with open(filepath, 'w') as f:
-        json.dump(stats, f, indent=4)
\ No newline at end of file
diff --git a/physicsnemo/metrics.py b/physicsnemo/metrics.py
deleted file mode 100644
index 5393c7a78e4ac9b90580901d2b1ba4a46924277f..0000000000000000000000000000000000000000
--- a/physicsnemo/metrics.py
+++ /dev/null
@@ -1,110 +0,0 @@
-import torch
-import numpy as np
-import torch.nn.functional as F
-
-def bce(input, target, weights=None):
-
-    if input.shape != target.shape:
-        if input.shape[-1] == 1 and input.shape[:-1] == target.shape:
-            input = input.squeeze(-1)
-        elif target.shape[-1] == 1 and target.shape[:-1] == input.shape:
-            target = target.squeeze(-1)
-
-    loss = F.binary_cross_entropy_with_logits(input, target, reduction='none')
-    return torch.mean(loss)
-
-def weighted_bce(input, target, weights=None):
-    """
-    Compute a weighted and label-normalized binary cross entropy (BCE) loss.
-
-    For each unique label in the target tensor, the BCE loss is computed and weighted,
-    then normalized by the sum of weights for that label. The final loss is the mean
-    of these per-label normalized losses.
-
-    Args:
-        input (Tensor): Predicted logits of shape (N, ...).
-        target (Tensor): Ground truth labels of shape (N, ...), with discrete label values.
-        weights (Tensor or None): Optional tensor of per-sample weights, same shape as input/target.
-
-    Returns:
-        Tensor: Scalar tensor representing the normalized weighted BCE loss.
-    """
-
-    if input.shape != target.shape:
-        if input.shape[-1] == 1 and input.shape[:-1] == target.shape:
-            input = input.squeeze(-1)
-        elif target.shape[-1] == 1 and target.shape[:-1] == input.shape:
-            target = target.squeeze(-1)
-
-    # Compute per-element BCE loss (no reduction)
-    loss = F.binary_cross_entropy_with_logits(input, target, reduction='none')
-
-    # If weights not provided, use ones
-    if weights is None:
-        weights = torch.ones_like(loss)
-
-    unique_labels = torch.unique(target)
-    normalized_losses = []
-    for label in unique_labels:
-        label_mask = (target == label)  # This will be a bool tensor
-        # Defensive: make sure mask is bool
-        if label_mask.dtype != torch.bool:
-            label_mask = label_mask.bool()
-        label_weights = weights[label_mask]
-        label_losses = loss[label_mask]
-        weight_sum = label_weights.sum()
-        if weight_sum > 0:
-            label_loss = (label_weights * label_losses).sum() / weight_sum
-            normalized_losses.append(label_loss)
-
-    if normalized_losses:
-        return torch.stack(normalized_losses).mean()
-    else:
-        return torch.tensor(0.0, device=input.device)
-    
-
-def roc_auc_score(classes : np.ndarray,
-               predictions : np.ndarray,
-               weights : np.ndarray = None) -> float:
-    """
-    Calculating ROC AUC score as the probability of correct ordering
-    """
-
-    if weights is None:
-        weights = np.ones_like(predictions)
-
-    assert len(classes) == len(predictions) == len(weights)
-    assert classes.ndim == predictions.ndim == weights.ndim == 1
-    class0, class1 = sorted(np.unique(classes))
-
-    data = np.empty(
-            shape=len(classes),
-            dtype=[('c', classes.dtype),
-                   ('p', predictions.dtype),
-                   ('w', weights.dtype)]
-        )
-    data['c'], data['p'], data['w'] = classes, predictions, weights
-
-    data = data[np.argsort(data['c'])]
-    data = data[np.argsort(data['p'], kind='mergesort')] # here we're relying on stability as we need class orders preserved
-
-    correction = 0.
-    # mask1 - bool mask to highlight collision areas
-    # mask2 - bool mask with collision areas' start points
-    mask1 = np.empty(len(data), dtype=bool)
-    mask2 = np.empty(len(data), dtype=bool)
-    mask1[0] = mask2[-1] = False
-    mask1[1:] = data['p'][1:] == data['p'][:-1]
-    if mask1.any():
-        mask2[:-1] = ~mask1[:-1] & mask1[1:]
-        mask1[:-1] |= mask1[1:]
-        ids, = mask2.nonzero()
-        correction = sum([((dsplit['c'] == class0) * dsplit['w'] * msplit).sum() * 
-                          ((dsplit['c'] == class1) * dsplit['w'] * msplit).sum()
-                          for dsplit, msplit in zip(np.split(data, ids), np.split(mask1, ids))]) * 0.5
- 
-    weights_0 = data['w'] * (data['c'] == class0)
-    weights_1 = data['w'] * (data['c'] == class1)
-    cumsum_0 = weights_0.cumsum()
-
-    return ((cumsum_0 * weights_1).sum() - correction) / (weights_1.sum() * cumsum_0[-1])
diff --git a/physicsnemo/models/Edge_Network.py b/physicsnemo/models/Edge_Network.py
deleted file mode 100644
index e7cf7591fa03375552f31a3978c2fa91cf1780e7..0000000000000000000000000000000000000000
--- a/physicsnemo/models/Edge_Network.py
+++ /dev/null
@@ -1,72 +0,0 @@
-import torch
-import torch.nn as nn
-import dgl
-
-from models import utils
-
-class Edge_Network(nn.Module):
-    def __init__(self, cfg):
-        super().__init__()
-        hid_size = cfg.hid_size
-        n_layers = cfg.n_layers
-        self.n_proc_steps = cfg.n_proc_steps
-
-        #encoder
-        self.node_encoder = utils.Make_MLP(cfg.input_dim_nodes, hid_size, hid_size, n_layers)
-        self.edge_encoder = utils.Make_MLP(cfg.input_dim_edges, hid_size, hid_size, n_layers)
-        self.global_encoder = utils.Make_MLP(cfg.input_dim_globals, hid_size, hid_size, n_layers)
-
-        #GNN
-        self.node_update = utils.Make_MLP(3*hid_size, hid_size, hid_size, n_layers)
-        self.edge_update = utils.Make_MLP(4*hid_size, hid_size, hid_size, n_layers)
-        self.global_update = utils.Make_MLP(3*hid_size, hid_size, hid_size, n_layers)
-
-        #decoder
-        self.global_decoder = utils.Make_MLP(hid_size, hid_size, hid_size, n_layers)
-        self.classify = nn.Linear(hid_size, cfg.out_dim)
-
-    def forward(self, node_feats, edge_feats, global_feats, batched_graph, metadata={}):
-        # encoders
-        batched_graph.ndata['h'] = self.node_encoder(node_feats)
-        batched_graph.edata['e'] = self.edge_encoder(edge_feats)
-
-        if global_feats.ndim == 3:
-            global_feats = global_feats.view(-1, global_feats.shape[-1])
-        h_global = self.global_encoder(global_feats)
-
-        # message passing
-        for _ in range(self.n_proc_steps):
-            batched_graph.apply_edges(dgl.function.copy_u('h', 'm_u'))
-            batched_graph.apply_edges(utils.copy_v)
-
-            # edge update
-            edge_inputs = torch.cat([
-                batched_graph.edata['e'],
-                batched_graph.edata['m_u'],
-                batched_graph.edata['m_v'],
-                utils.broadcast_global_to_edges(h_global, edge_split=metadata.get("batch_num_edges", None))
-            ], dim=1)
-            batched_graph.edata['e'] = self.edge_update(edge_inputs)
-
-            # node update
-            batched_graph.update_all(dgl.function.copy_e('e', 'm'), dgl.function.sum('m', 'h_e'))
-            node_inputs = torch.cat([
-                batched_graph.ndata['h'],
-                batched_graph.ndata['h_e'],
-                utils.broadcast_global_to_nodes(h_global, node_split=metadata.get("batch_num_nodes", None))
-            ], dim=1)
-            batched_graph.ndata['h'] = self.node_update(node_inputs)
-
-            # global update
-            graph_node_feat = utils.mean_nodes(
-                batched_graph, 'h', node_split=metadata.get("batch_num_nodes", None)
-            )
-            graph_edge_feat = utils.mean_edges(
-                batched_graph, 'e', edge_split=metadata.get("batch_num_edges", None)
-            )
-            h_global = self.global_update(torch.cat([h_global, graph_node_feat, graph_edge_feat], dim=1))
-
-        h_global = self.global_decoder(h_global)
-        out = self.classify(h_global)
-        return out
-
diff --git a/physicsnemo/models/MeshGraphNet.py b/physicsnemo/models/MeshGraphNet.py
deleted file mode 100644
index c20cd3813f1373900f4fdaab25e6b0964a678daf..0000000000000000000000000000000000000000
--- a/physicsnemo/models/MeshGraphNet.py
+++ /dev/null
@@ -1,51 +0,0 @@
-import torch
-import torch.nn as nn
-import dgl
-
-from models import utils
-
-# Import the PhysicsNemo MeshGraphNet model
-from physicsnemo.models.meshgraphnet import MeshGraphNet as PhysicsNemoMeshGraphNet
-
-class MeshGraphNet(nn.Module):
-    def __init__(self, cfg):
-        super().__init__()
-        base_gnn_cfg = cfg.base_gnn
-        self.base_gnn = PhysicsNemoMeshGraphNet(**base_gnn_cfg)
-
-        self.global_mlp = nn.Sequential(
-            nn.Linear(cfg.global_feat_dim, cfg.global_emb_dim),
-            nn.ReLU(),
-        )
-
-        self.mlp = nn.Linear(
-            base_gnn_cfg['output_dim'] + base_gnn_cfg['input_dim_edges'] + cfg.global_emb_dim,
-            cfg.out_dim
-        )
-    
-    def forward(self, node_feats, edge_feats, global_feats, batched_graph, metadata={}):
-        """
-        node_feats: [total_num_nodes, node_feat_dim]
-        edge_feats: [total_num_edges, edge_feat_dim]
-        global_feats: [num_graphs, global_feat_dim]
-        batched_graph: DGLGraph, representing the collection of graphs in a batch
-        metadata: dict, may contain 'batch_num_nodes', 'batch_num_edges', etc.
-        Returns:
-            graph_pred: [num_graphs, out_dim]
-        """
-        node_pred = self.base_gnn(node_feats, edge_feats, batched_graph)
-        batched_graph.ndata['h'] = node_pred
-        batched_graph.edata['e'] = edge_feats
-
-        graph_node_feat = utils.mean_nodes(batched_graph, 'h', node_split=metadata.get("batch_num_nodes", None))
-        graph_edge_feat = utils.mean_edges(batched_graph, 'e', edge_split=metadata.get("batch_num_edges", None))
-
-        # Flatten global_feats if needed
-        if global_feats.ndim == 3:
-            global_feats = global_feats.view(-1, global_feats.shape[-1])
-        global_emb = self.global_mlp(global_feats)  # [num_graphs, global_emb_dim]
-
-        combined_feat = torch.cat([graph_node_feat, graph_edge_feat, global_emb], dim=-1)
-        graph_pred = self.mlp(combined_feat)
-        return graph_pred
-    
diff --git a/physicsnemo/models/utils.py b/physicsnemo/models/utils.py
deleted file mode 100644
index 2823e5397f5a518836351106cbbc9fa884338d4e..0000000000000000000000000000000000000000
--- a/physicsnemo/models/utils.py
+++ /dev/null
@@ -1,135 +0,0 @@
-import torch
-import torch.nn as nn
-import dgl
-
-def mean_nodes(batched_graph, feat_key='h', op='mean', node_split=None):
-    """
-    Aggregates node features per disjoint graph in a batched DGLGraph.
-
-    Args:
-        batched_graph: DGLGraph
-        feat_key: str, node feature key
-        op: 'mean', 'sum', or 'max'
-        node_split: 1D tensor or list of ints (num nodes per graph)
-
-    Returns:
-        Tensor of shape [num_graphs, node_feat_dim]
-    """
-    h = batched_graph.ndata[feat_key]
-    if node_split is None or len(node_split) == 0:
-        if op == 'mean':
-            return dgl.mean_nodes(batched_graph, feat_key)
-        elif op == 'sum':
-            return dgl.sum_nodes(batched_graph, feat_key)
-        elif op == 'max':
-            return dgl.max_nodes(batched_graph, feat_key)
-        else:
-            raise ValueError(f"Unknown op: {op}")
-    else:
-        # Ensure node_split is a flat list of ints
-        if isinstance(node_split, torch.Tensor):
-            splits = node_split.view(-1).tolist()
-        else:
-            splits = [int(x) for x in node_split]
-        chunks = torch.split(h, splits, dim=0)
-        if op == 'mean':
-            out = torch.stack([chunk.mean(0) if chunk.shape[0] > 0 else torch.zeros_like(h[0]) for chunk in chunks])
-        elif op == 'sum':
-            out = torch.stack([chunk.sum(0) if chunk.shape[0] > 0 else torch.zeros_like(h[0]) for chunk in chunks])
-        elif op == 'max':
-            out = torch.stack([chunk.max(0).values if chunk.shape[0] > 0 else torch.zeros_like(h[0]) for chunk in chunks])
-        else:
-            raise ValueError(f"Unknown op: {op}")
-        return out
-    
-def mean_edges(batched_graph, feat_key='e', op='mean', edge_split=None):
-    """
-    Aggregates edge features per disjoint graph in a batched DGLGraph.
-
-    Args:
-        batched_graph: DGLGraph
-        feat_key: str, edge feature key
-        op: 'mean', 'sum', or 'max'
-        edge_split: 1D tensor or list of ints (num edges per graph)
-
-    Returns:
-        Tensor of shape [num_graphs, edge_feat_dim]
-    """
-    e = batched_graph.edata[feat_key]
-    if edge_split is None or len(edge_split) == 0:
-        if op == 'mean':
-            return dgl.mean_edges(batched_graph, feat_key)
-        elif op == 'sum':
-            return dgl.sum_edges(batched_graph, feat_key)
-        elif op == 'max':
-            return dgl.max_edges(batched_graph, feat_key)
-        else:
-            raise ValueError(f"Unknown op: {op}")
-    else:
-        # Ensure edge_split is a flat list of ints
-        if isinstance(edge_split, torch.Tensor):
-            splits = edge_split.view(-1).tolist()
-        else:
-            splits = [int(x) for x in edge_split]
-        chunks = torch.split(e, splits, dim=0)
-        if op == 'mean':
-            out = torch.stack([chunk.mean(0) if chunk.shape[0] > 0 else torch.zeros_like(e[0]) for chunk in chunks])
-        elif op == 'sum':
-            out = torch.stack([chunk.sum(0) if chunk.shape[0] > 0 else torch.zeros_like(e[0]) for chunk in chunks])
-        elif op == 'max':
-            out = torch.stack([chunk.max(0).values if chunk.shape[0] > 0 else torch.zeros_like(e[0]) for chunk in chunks])
-        else:
-            raise ValueError(f"Unknown op: {op}")
-        return out
-    
-def Make_SLP(in_size, out_size, activation = nn.ReLU, dropout = 0):
-    layers = []
-    layers.append(nn.Linear(in_size, out_size))
-    layers.append(activation())
-    layers.append(nn.Dropout(dropout))
-    return layers
-
-def Make_MLP(in_size, hid_size, out_size, n_layers, activation = nn.ReLU, dropout = 0):
-    layers = []
-    if n_layers > 1:
-        layers += Make_SLP(in_size, hid_size, activation, dropout)
-        for i in range(n_layers-2):
-            layers += Make_SLP(hid_size, hid_size, activation, dropout)
-        layers += Make_SLP(hid_size, out_size, activation, dropout)
-    else:
-        layers += Make_SLP(in_size, out_size, activation, dropout)
-    layers.append(torch.nn.LayerNorm(out_size))
-    return nn.Sequential(*layers)
-
-def broadcast_global_to_nodes(globals, node_split):
-    """
-    globals: [num_graphs, global_dim]
-    node_split: list/1D tensor of length num_graphs, number of nodes per graph
-    Returns: [total_num_nodes, global_dim]
-    """
-    if node_split is None:
-        raise ValueError("node_split must be provided")
-    if not torch.is_tensor(node_split):
-        node_split = torch.tensor(node_split, dtype=torch.long, device=globals.device)
-    else:
-        node_split = node_split.to(device=globals.device, dtype=torch.long)
-    node_split = node_split.flatten()
-    return torch.repeat_interleave(globals, node_split, dim=0)
-
-def broadcast_global_to_edges(globals, edge_split):
-    """
-    globals: [num_graphs, global_dim] (on CUDA or CPU)
-    edge_split: list/1D tensor of length num_graphs, number of edges per graph (CPU or CUDA)
-    Returns: [total_num_edges, global_dim]
-    """
-    if edge_split is None:
-        raise ValueError("edge_split must be provided")
-    if not torch.is_tensor(edge_split):
-        edge_split = torch.tensor(edge_split, dtype=torch.long, device=globals.device)
-    else:
-        edge_split = edge_split.to(device=globals.device, dtype=torch.long)
-    edge_split = edge_split.flatten()
-    return torch.repeat_interleave(globals, edge_split, dim=0)
-
-def copy_v(edges):
-    return {'m_v': edges.dst['h']}
\ No newline at end of file
diff --git a/physicsnemo/setup/Dockerfile b/physicsnemo/setup/Dockerfile
deleted file mode 100755
index 89cf6d2f530e3a8c186e82c43f22e3f428f4a65f..0000000000000000000000000000000000000000
--- a/physicsnemo/setup/Dockerfile
+++ /dev/null
@@ -1,23 +0,0 @@
-FROM nvcr.io/nvidia/physicsnemo/physicsnemo:25.06
-
-WORKDIR /global/cfs/projectdirs/atlas/joshua/GNN4Colliders
-
-LABEL maintainer.name="Joshua Ho"
-LABEL maintainer.email="ho22joshua@berkeley.edu"
-
-ENV LANG=C.UTF-8
-
-# Install system dependencies: vim, OpenMPI, and build tools
-RUN apt-get update -qq \
- && apt-get install -y --no-install-recommends \
-    wget lsb-release gnupg software-properties-common \
-    vim \
-    g++-11 gcc-11 libstdc++-11-dev \
-    openmpi-bin openmpi-common libopenmpi-dev \
- && rm -rf /var/lib/apt/lists/*
-
-# Install Python packages: mpi4py and jupyter
-RUN pip install --no-cache-dir mpi4py jupyter uproot
-
-# (Optional) Expose Jupyter port
-EXPOSE 8888
diff --git a/physicsnemo/setup/build_image.sh b/physicsnemo/setup/build_image.sh
deleted file mode 100755
index f9d1ecbb9bda5da752e97ed025f51d454ea6566b..0000000000000000000000000000000000000000
--- a/physicsnemo/setup/build_image.sh
+++ /dev/null
@@ -1,4 +0,0 @@
-tag=$1
-echo $tag
-podman-hpc build -t joshuaho/nemo:$tag --platform linux/amd64 .
-podman-hpc migrate joshuaho/nemo:$tag
diff --git a/physicsnemo/train.py b/physicsnemo/train.py
deleted file mode 100644
index 0801dd4bff1b49aa0fa2561ea9187f8d096495bd..0000000000000000000000000000000000000000
--- a/physicsnemo/train.py
+++ /dev/null
@@ -1,246 +0,0 @@
-import time, os
-
-start = time.time()
-import torch
-from torch.nn.parallel import DistributedDataParallel
-from dgl.dataloading import GraphDataLoader
-from torch.amp import GradScaler
-import numpy as np
-import hydra
-from omegaconf import DictConfig
-from physicsnemo.launch.logging import (
-    PythonLogger,
-    RankZeroLoggingWrapper,
-)
-from physicsnemo.launch.utils import load_checkpoint, save_checkpoint
-from physicsnemo.distributed.manager import DistributedManager
-
-import json
-from tqdm import tqdm
-import random
-
-import models.MeshGraphNet as MeshGraphNet
-from dataset.Dataset import get_dataset
-import metrics
-
-import utils
-
-class MGNTrainer:
-    def __init__(self, logger, cfg, dist):
-        # set device
-        self.device = dist.device
-        logger.info(f"Using {self.device} device")
-
-        start = time.time()
-        self.trainloader, self.valloader, self.testloader = get_dataset(cfg, self.device)
-        print(f"total time loading dataset: {time.time() - start:.2f} seconds")
-
-        dtype_str = getattr(cfg.root_dataset, "dtype", "torch.float32")
-        if isinstance(dtype_str, str) and dtype_str.startswith("torch."):
-            self.dtype = getattr(torch, dtype_str.split(".")[-1], torch.float32)
-        else:
-            self.dtype = torch.float32
-
-        self.model = utils.build_from_module(cfg.architecture)
-        self.model = self.model.to(dtype=self.dtype, device=self.device)
-        # num_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
-        # print(f"Number of trainable parameters: {num_params}")
-
-        if cfg.performance.jit:
-            self.model = torch.jit.script(self.model).to(self.device)
-        else:
-            self.model = self.model.to(self.device)
-
-        # instantiate loss, optimizer, and scheduler
-        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=cfg.scheduler.lr)
-        self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
-            self.optimizer,
-            T_max=cfg.training.epochs,
-            eta_min=cfg.scheduler.lr * cfg.scheduler.lr_decay,
-        )
-        self.scaler = GradScaler('cuda')
-
-        # load checkpoint
-        self.epoch_init = load_checkpoint(
-            os.path.join(cfg.checkpoints.ckpt_path, cfg.checkpoints.ckpt_name),
-            models=self.model,
-            optimizer=self.optimizer,
-            scheduler=self.scheduler,
-            scaler=self.scaler,
-            device=self.device,
-        )
-
-        self.cfg = cfg
-
-    def backward(self, loss):
-        """
-        Perform backward pass.
-
-        Arguments:
-            loss: loss value.
-
-        """
-        # backward pass
-        if self.cfg.performance.amp:
-            self.scaler.scale(loss).backward()
-            self.scaler.step(self.optimizer)
-            self.scaler.update()
-        else:
-            loss.backward()
-            self.optimizer.step()
-
-    def train(self, graph, metadata):
-        """
-        Perform one training iteration over one graph. The training is performed
-        over multiple timesteps, where the number of timesteps is specified in
-        the 'stride' parameter.
-
-        Arguments:
-            graph: the desired graph.
-
-        Returns:
-            loss: loss value.
-
-        """
-        graph = graph.to(self.device, non_blocking=True)
-        globals = metadata['globals'].to(self.device, non_blocking=True)
-        label = metadata['label'].to(self.device, non_blocking=True)
-        weight =  metadata['weight'].to(self.device, non_blocking=True)
-
-        self.optimizer.zero_grad()
-        pred = self.model(graph.ndata["features"], graph.edata["features"], globals, graph, metadata)
-        loss = metrics.weighted_bce(pred, label, weights=weight)
-        self.backward(loss)
-        return loss.detach()
-
-    @torch.no_grad()
-    def eval(self):
-        """
-        Evaluate the model on one batch.
-
-        Args:
-            graph (DGLGraph): The input graph.
-            label (Tensor): The target labels.
-
-        Returns:
-            loss (Tensor): The computed loss value (scalar).
-        """
-        predictions = []
-        labels = []
-        weights = []
-
-        for graph, metadata in self.valloader:
-            
-            graph = graph.to(self.device, non_blocking=True)
-            globals = metadata['globals'].to(self.device, non_blocking=True)
-            label = metadata['label'].to(self.device, non_blocking=True)
-            weight =  metadata['weight'].to(self.device, non_blocking=True)
-            
-            pred = self.model(graph.ndata["features"], graph.edata["features"], globals, graph, metadata)
-            predictions.append(pred)
-            labels.append(label)
-            weights.append(weight)     
-
-        predictions = torch.cat(predictions, dim=0)
-        labels = torch.cat(labels, dim=0)
-        weights = torch.cat(weights, dim=0)
-
-        loss = metrics.weighted_bce(predictions, labels, weights=weights)
-        
-        # Convert logits to probabilities
-        prob = torch.sigmoid(predictions)
-
-        # Flatten to 1D arrays
-        prob_flat = prob.detach().to(torch.float32).cpu().numpy().flatten()
-        labels_flat = labels.detach().to(torch.float32).cpu().numpy().flatten()
-
-        # Calculate AUC
-        try:
-            auc = metrics.roc_auc_score(labels_flat, prob_flat)
-        except ValueError:
-            auc = float('nan')  # Not enough classes present for AUC
-
-        return loss, auc
-
-@hydra.main(version_base=None, config_path="./configs/", config_name="tHjb_CP_0_vs_45")
-def do_training(cfg: DictConfig):
-    """
-    Perform training over all graphs in the dataset.
-
-    Arguments:
-        cfg: Dictionary of parameters.
-
-    """
-    random.seed(cfg.random_seed)
-    np.random.seed(cfg.random_seed)
-    torch.manual_seed(cfg.random_seed)
-
-    # initialize distributed manager
-    DistributedManager.initialize()
-    dist = DistributedManager()
-
-    # initialize loggers
-    os.makedirs(cfg.checkpoints.ckpt_path, exist_ok=True)
-    logger = PythonLogger("main")
-    logger.file_logging(os.path.join(cfg.checkpoints.ckpt_path, "train.log"))
-
-    # initialize trainer
-    trainer = MGNTrainer(logger, cfg, dist)
-
-    if dist.distributed:
-        ddps = torch.cuda.Stream()
-        with torch.cuda.stream(ddps):
-            trainer.model = DistributedDataParallel(
-                trainer.model,
-                device_ids=[dist.local_rank],  # Set the device_id to be
-                                               # the local rank of this process on
-                                               # this node
-                output_device=dist.device,
-                broadcast_buffers=dist.broadcast_buffers,
-                find_unused_parameters=dist.find_unused_parameters,
-            )
-        torch.cuda.current_stream().wait_stream(ddps)
-
-    # training loop
-    start = time.time()
-    logger.info("Training started...")
-    for epoch in range(trainer.epoch_init, cfg.training.epochs):
-
-        # Training
-        train_loss = []
-        for graph, metadata in tqdm(trainer.trainloader, desc=f"epoch {epoch} trianing"):
-            trainer.model.train()
-            loss = trainer.train(graph, metadata)
-            train_loss.append(loss.item())
-        
-        val_loss, val_auc = trainer.eval()
-
-        train_loss = torch.tensor(train_loss).mean()
-
-        logger.info(
-            f"epoch: {epoch}, loss: {train_loss:10.3e}, val_loss: {val_loss:10.3e}, val_auc = {val_auc:10.3e}, time per epoch: {(time.time()-start):10.3e}"
-        )
-
-        # save checkpoint
-        save_checkpoint(
-            os.path.join(cfg.checkpoints.ckpt_path, cfg.checkpoints.ckpt_name),
-            models=trainer.model,
-            optimizer=trainer.optimizer,
-            scheduler=trainer.scheduler,
-            scaler=trainer.scaler,
-            epoch=epoch,
-        )
-        start = time.time()
-        trainer.scheduler.step()
-    logger.info("Training completed!")
-
-
-"""
-    Perform training over all graphs in the dataset.
-
-    Arguments:
-        cfg: Dictionary of parameters.
-
-    """
-if __name__ == "__main__":
-    do_training()
\ No newline at end of file
diff --git a/physicsnemo/utils.py b/physicsnemo/utils.py
deleted file mode 100644
index bbbdb423dd993b19ca69a3f5c648ad2bd49b46c9..0000000000000000000000000000000000000000
--- a/physicsnemo/utils.py
+++ /dev/null
@@ -1,11 +0,0 @@
-import importlib
-from types import SimpleNamespace
-
-def build_from_module(cfg):
-    modname = cfg['module']
-    classname = cfg['class']
-    args = cfg['args']
-    module = importlib.import_module(modname)
-    model_cls = getattr(module, classname)
-    cfg_obj = SimpleNamespace(**args)
-    return model_cls(cfg_obj)
\ No newline at end of file
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-data-preparation/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-data-preparation/SKILL.md
deleted file mode 100644
index 282f6d611abe5fdc5d0ce550b282fe9ffcc63f73..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-data-preparation/SKILL.md
+++ /dev/null
@@ -1,78 +0,0 @@
----
-name: root-gnn-dgl-data-preparation
-description: Use when the user asks to build graphs or prebatched .bin files, rerun failed data prep, verify missing graph chunks, or use scripts/prep_data.py, scripts/check_dataset_files.py, or jobs/prep_data/prep_data.sh before training in root_gnn_dgl.
----
-
-# root-gnn-dgl-data-preparation
-
-Use this skill for graph creation and graph-readiness checks before training.
-
-## Primary entry point
-
-Run from the repo root:
-
-```bash
-python scripts/prep_data.py --config <config.yaml> --dataset <dataset_name> --chunk <chunk_index>
-```
-
-Use `--shuffle_mode` when you want preshuffled, prebatched graph files for training:
-
-```bash
-python scripts/prep_data.py --config <config.yaml> --dataset <dataset_name> --shuffle_mode --chunk <chunk_index>
-```
-
-## Recommended run pattern
-
-- Read dataset names from `config["Datasets"]`.
-- Read the chunk count from each dataset's `args.chunks`.
-- When matching the repo's own wrappers, run chunk `0` once before the full loop. Both `run_demo.sh` and `jobs/prep_data/prep_data.sh` do this.
-
-Example pattern:
-
-```bash
-python scripts/prep_data.py --config configs/stats_100K/pretraining_multiclass.yaml --dataset ttH --shuffle_mode --chunk 0
-for i in 0 1 2; do
-  python scripts/prep_data.py --config configs/stats_100K/pretraining_multiclass.yaml --dataset ttH --shuffle_mode --chunk "$i"
-done
-```
-
-Use the repo wrapper when you want the standard loop:
-
-```bash
-bash jobs/prep_data/prep_data.sh <config> <dataset> <chunks> [extra_args]
-```
-
-## Important flags and caveats
-
-- `--shuffle_mode` creates the prebatched artifacts consumed by `scripts/training_script.py --preshuffle`.
-- `--drop_last` is inverted by the CLI definition: passing the flag sets `drop_last=False`.
-- Dataset configs can override training batch size during prebatching with a dataset-level `batch_size`.
-- The README says a `list index out of range` after graph saving is currently expected in some prep runs. Treat it as non-fatal if the output `.bin` files were written successfully.
-
-## Audit the outputs
-
-Run from the repo root:
-
-```bash
-python scripts/check_dataset_files.py --configs stats_100K/pretraining_multiclass.yaml
-```
-
-The `--configs` argument must be a comma-separated list of paths relative to `configs/`.
-
-This checker validates:
-
-- chunk files named `${dataset}_${chunk}.bin`
-- prebatched fold files named `${dataset}_prebatched_padded_${i}_n_${n_folds}_f_${foldlist}.bin`
-
-Use rerun mode to repair missing artifacts:
-
-```bash
-python scripts/check_dataset_files.py --configs stats_100K/pretraining_multiclass.yaml --rerun
-```
-
-Treat data prep as ready only if:
-
-- every required chunk file exists
-- every required prebatched fold file exists when training will use `--preshuffle`
-- save paths match the config
-- any post-save `IndexError` did not prevent the files from being written
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-env-setup/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-env-setup/SKILL.md
deleted file mode 100644
index 019d6834098b8dd05141b02591c50dac6f017d84..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-env-setup/SKILL.md
+++ /dev/null
@@ -1,63 +0,0 @@
----
-name: root-gnn-dgl-env-setup
-description: Use when the user asks to set up or validate the root_gnn_dgl runtime, such as running conda setup from setup/environment.yml, setup/test_setup.py, import ROOT checks, podman-hpc image setup, or the interactive allocation scripts in jobs/ before data prep, training, or inference.
----
-
-# root-gnn-dgl-env-setup
-
-Use this skill from the repo root before any stage run.
-
-## Choose the runtime
-
-- Use the conda environment in `setup/environment.yml` for `scripts/inference.py`. The repo README says inference needs PyROOT, and the podman image does not include ROOT.
-- Use the `podman-hpc` image `joshuaho/pytorch:1.0` for training on Perlmutter when you want the containerized path.
-- For parallel inference, make sure `mpi4py` is available. The README notes it is not listed in the conda environment requirements; `setup/Dockerfile` installs it in the container image.
-
-## Conda path
-
-```bash
-cd setup
-conda env create -f environment.yml
-conda activate pytorch
-cd ..
-python setup/test_setup.py
-python -c "import ROOT"
-```
-
-Run `setup/test_setup.py` from the repo root. It appends the current working directory to `sys.path` and checks imports in `scripts`, `root_gnn_base`, and `models`.
-
-## Podman path
-
-```bash
-podman-hpc pull docker.io/joshuaho/pytorch:1.0
-```
-
-Or build locally:
-
-```bash
-cd setup
-source build_image.sh
-```
-
-The helper `setup/launch_image.sh` mounts `/pscratch/sd/j/joshuaho/` and `/global/cfs/projectdirs/atlas/joshua/` into the container and then runs the given entrypoint.
-
-## Interactive allocations
-
-- `source jobs/interactive.sh` for one shared interactive GPU node.
-- `source jobs/cpu.sh` for a CPU allocation that suits large prep loops.
-- `source jobs/salloc.sh` for a multi-node GPU allocation.
-
-## Runtime audit
-
-- Use `nvidia-smi` before training on login or interactive nodes to confirm memory availability.
-- Validate the basic repo imports with `python setup/test_setup.py`.
-- Validate PyROOT explicitly with `python -c "import ROOT"`.
-- For parallel inference, also validate `python -c "from mpi4py import MPI"`.
-- Some repo scripts hard-code NERSC-style paths under `/global/cfs/projectdirs/atlas/joshua/...`. If running elsewhere, fix those paths before assuming the environment is valid.
-
-Treat environment setup as passing only if:
-
-- imports succeed
-- the chosen runtime matches the stage you plan to run
-- required site-specific paths exist
-- GPU or CPU resources are actually available for the intended stage
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-inference/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-inference/SKILL.md
deleted file mode 100644
index c53e33fdf8e8c61ad38b14623aa98388c8bd2a65..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-inference/SKILL.md
+++ /dev/null
@@ -1,112 +0,0 @@
----
-name: root-gnn-dgl-inference
-description: Use when the user asks to score ROOT files, add GNN score branches, launch MPI inference, or verify inference outputs in root_gnn_dgl, including scripts/inference.py, jobs/inference/run_inference.py, ROOT branch checks, and NPZ or ROOT output audits.
----
-
-# root-gnn-dgl-inference
-
-Use this skill to score ntuples with trained models and verify the outputs.
-
-## Environment
-
-- Run inference in an environment with PyROOT available.
-- The repo README says the conda environment is required for inference because the podman training image does not include ROOT.
-
-## Entry point
-
-Run from the repo root:
-
-```bash
-python scripts/inference.py \
-  --target <input.root> \
-  --destination <output.root> \
-  --config <config.yaml> \
-  --branch_name <branch_name> \
-  --chunks 1 \
-  --chunkno 0 \
-  --write
-```
-
-## Multi-model inference
-
-The script accepts multiple configs and multiple branch names in one run:
-
-```bash
-python scripts/inference.py \
-  --target <input.root> \
-  --destination <output.root> \
-  --config config_a.yaml config_b.yaml \
-  --branch_name score_a score_b \
-  --chunks 1 \
-  --chunkno 0 \
-  --write
-```
-
-The number of configs and branch names must match.
-
-## Checkpoint selection
-
-- With the default `--ckpt -1`, the script selects the best epoch from `training.log` using `--var` and `--mode`.
-- Use `--ckpt <n>` to force a specific checkpoint.
-- If `--destination` is omitted, the script writes under `<Training_Directory>/inference/`.
-
-## Output modes
-
-- `--write` creates a new ROOT file and adds score branches.
-- Without `--write`, the script saves an `.npz` bundle containing scores, labels, and tracking info.
-- Use `--clobber` when reusing an existing destination path.
-
-## Parallel inference
-
-The repo includes an MPI wrapper:
-
-```bash
-mpirun -np <num_ranks> python jobs/inference/run_inference.py
-```
-
-That script is campaign-specific, with hard-coded file lists, configs, branches, and destinations. Patch or copy it before using it for a different campaign.
-
-## Repo-specific behavior
-
-- The first config's first dataset is used as the template dataset. The script rewrites `raw_dir`, `file_names`, `save_dir`, `chunks`, `process_chunks`, and optionally `tree_name` at runtime.
-- Pass `--tree <name>` if the ROOT tree name differs from the config default.
-- Chunked inference writes per-chunk outputs; merging those outputs is a separate step.
-- The script prepends a hard-coded repo path near the top of the file. If imports fail outside the expected NERSC layout, fix that path first.
-
-## Audit the outputs
-
-Start with basic runtime evidence if you have a log:
-
-- `Writing to file`
-- `Input entries:`
-- `Output entries:`
-- `Wrote scores to`
-- absence of `Traceback`
-
-For ROOT outputs, prefer `uproot`:
-
-```bash
-python - <<'PY'
-import numpy as np
-import uproot
-path = "<output.root>"
-branches = ["<score_branch>"]
-tree = uproot.open(path)["output"]
-print("entries", tree.num_entries)
-for branch in branches:
-    arr = tree[branch].array(library="np")
-    print(branch, len(arr), np.isnan(arr).sum(), float(np.nanmin(arr)), float(np.nanmax(arr)), float(np.nanmean(arr)))
-PY
-```
-
-Treat inference as valid only if:
-
-- the destination file exists
-- every requested score branch exists
-- output entry count matches the input tree
-- score arrays contain no NaNs
-- score arrays are not constant
-
-For multi-model inference, every branch must exist and branch statistics should usually differ unless the models are intentionally identical.
-
-Without `--write`, inspect the `.npz` keys `scores`, `labels`, and `tracking_info` and verify array lengths and NaN counts.
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-plotting/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-plotting/SKILL.md
deleted file mode 100644
index de2c79e2e5a0b919d0f4df2f6bb12d53a9183a43..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-plotting/SKILL.md
+++ /dev/null
@@ -1,80 +0,0 @@
----
-name: root-gnn-dgl-plotting
-description: Use when the user asks to plot training curves, regenerate training.png, build the sweep PDF from plotting/training_performance.py, compare training runs, or extract Loss, Accuracy, Test_Loss, Test_AUC, and timing information from training.log files in root_gnn_dgl.
----
-
-# root-gnn-dgl-plotting
-
-Use this skill when the task is about plots or metrics derived from `training.log`.
-
-## Single-run plot regeneration
-
-The training script can regenerate the per-run PNG directly:
-
-```bash
-python scripts/training_script.py --config <config.yaml> --plot
-```
-
-That uses `root_gnn_base.utils.read_log()` and `root_gnn_base.utils.plot_log()` to rebuild `training.png` from `Training_Directory/training.log`.
-
-`plot_log()` produces a 2x2 figure with:
-
-- cumulative time in seconds
-- train and test loss
-- accuracy
-- test AUC
-
-Be aware that `plot_log()` fixes the accuracy axis to `(0.44, 0.56)`, which may be too narrow for some runs.
-
-## Sweep-level plotting
-
-Use the dedicated plotting script when the user wants a PDF comparing shipped sweeps:
-
-```bash
-python plotting/training_performance.py
-python plotting/training_performance.py --output <output.pdf>
-```
-
-The script currently plots two config groups:
-
-- `pretraining`
-- `higgs_production`
-
-It writes one PDF page per group and resolves each run's `Training_Directory` from its config.
-
-## What the plotting script reads from training.log
-
-`plotting/training_performance.py` parses rows that start with `Epoch` and extracts:
-
-- `Epoch`
-- `Loss`
-- `Accuracy`
-- `Test_Loss`
-- `Test_AUC`
-- `Time`
-
-It also computes cumulative time in hours.
-
-Baseline runs are drawn as lines. Non-baseline sweep variants are drawn as point clouds and labeled by the parameter change relative to the baseline.
-
-## Audit the log before plotting
-
-Treat plotting input as valid only if:
-
-- `training.log` exists
-- it contains at least one valid `Epoch ...` row
-- parsed metric arrays are finite
-- the referenced `Training_Directory` actually exists
-
-If the plotting script fails, inspect the log directly:
-
-```bash
-sed -n '1,40p' <training_dir>/training.log
-tail -n 25 <training_dir>/training.log
-```
-
-## When to use which plot path
-
-- Use `--plot` on `scripts/training_script.py` when the user wants the repo's standard per-run `training.png`.
-- Use `plotting/training_performance.py` when the user wants a cross-run PDF for the built-in sweep groups.
-- If the user wants custom comparisons outside the built-in groups, start from the parsing logic in `plotting/training_performance.py` and the metric schema in `training.log`.
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-training/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-training/SKILL.md
deleted file mode 100644
index 74af75563b59296aa0be102a2b2075614280aacf..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-training/SKILL.md
+++ /dev/null
@@ -1,123 +0,0 @@
----
-name: root-gnn-dgl-training
-description: Use when the user asks to train, finetune, resume, submit, queue-check, log-check, or validate training runs in root_gnn_dgl, including scripts/training_script.py, sqs, jobs/slurm/<JOBID>.out, training.log, checkpoints, and scripts/generate_multiclass_finetuning_configs.py.
----
-
-# root-gnn-dgl-training
-
-Use this skill for any training stage in the repo, from launch through monitoring and artifact review.
-
-## Run locally
-
-Run from the repo root:
-
-```bash
-python scripts/training_script.py --config <config.yaml> --preshuffle --nocompile --lazy
-```
-
-This matches the README, `run_demo.sh`, and the podman job wrappers.
-
-## Why these defaults
-
-- `--preshuffle` uses the saved prebatched graph files created during data prep.
-- `--nocompile` is recommended by the README because compiled mode requires padded graphs at prep time.
-- `--lazy` matches the common dataset classes used by the shipped configs.
-
-## Common runtime modes
-
-- `--restart` starts from scratch instead of resuming from the last checkpoint.
-- Without `--restart`, the script resumes from the last `model_epoch_<n>.pt` it finds in `Training_Directory`.
-- `--evaluate <epoch>` skips training and evaluates a specific checkpoint.
-- `--plot` regenerates `training.png` from `training.log`.
-- `--directory <suffix>` appends a suffix to `Training_Directory`.
-- `--cpu`, `--multigpu`, `--multinode`, `--statistics`, `--seed`, and `--abs` are available when needed.
-
-## Submit on Perlmutter
-
-Prefer the podman path:
-
-```bash
-sbatch jobs/training/podman/run_job.sh <config>
-```
-
-The repo also ships `jobs/training/podman/submit.sh` for hard-coded sweeps and a conda-based Slurm path, but the podman wrapper is the more reliable default here.
-
-For distributed training, pass `--multinode` and launch under an environment that sets `RANK`, `LOCAL_RANK`, and `WORLD_SIZE`.
-
-## Preconditions
-
-- If you use `--preshuffle`, run data preparation first and confirm the graph artifacts exist.
-- For finetuning configs, verify that `Model.args.pretraining_path` points to an existing checkpoint before launching training.
-- For multinode runs, pass `--multinode` and launch under the relevant distributed job environment.
-
-## Monitor queue and logs
-
-Check queue state:
-
-```bash
-sqs -u "$USER"
-sqs -u "$USER" | rg "<pattern>"
-```
-
-Useful interpretations:
-
-- `PD` means pending
-- `R` means running
-- `START_TIME N/A` with reason `Priority` means queued normally, not broken
-
-Once a job has a `JOBID`, inspect:
-
-```bash
-sed -n '1,80p' jobs/slurm/<JOBID>.out
-tail -n 80 jobs/slurm/<JOBID>.out
-rg -n "Traceback|Error|Exception|Epoch|Epoch Done|Early Termination|Done" jobs/slurm/<JOBID>.out
-```
-
-Healthy training logs usually show:
-
-- the `Executing: python -u ... scripts/training_script.py ...` line
-- dataset cache loads
-- repeated `Epoch ... | LR ... | Loss ... | Accuracy ... | Test_Loss ... | Test_AUC ... | Time ... s`
-- `Epoch Done.`
-- `Num batches trained = ...`
-- valid completion via `Done`, sometimes after `Early Termination at Epoch ...`
-
-Early stopping is a normal completion mode in this repo.
-
-## Audit training artifacts
-
-Training writes into `Training_Directory`:
-
-- `config.yaml`
-- `model_epoch_<n>.pt`
-- `model_epoch_<n>.npz`
-- `training.log`
-- `training.png`
-
-Primary checks:
-
-```bash
-sed -n '1,40p' <training_dir>/training.log
-tail -n 25 <training_dir>/training.log
-```
-
-Treat the run as healthy only if:
-
-- epoch numbers increase monotonically
-- `Loss`, `Test_Loss`, and `Test_AUC` stay finite
-- the latest logged epoch has a matching `model_epoch_<n>.pt`
-- the run produced real epoch rows rather than stopping before training started
-
-If `training.log` grows but checkpoints stop appearing, suspect a save-path or filesystem issue.
-
-Use `python plotting/training_performance.py` or the `root-gnn-dgl-plotting` skill when you want consolidated sweep-level PDFs instead of a single-run `training.png`.
-
-## Generate finetuning configs
-
-Use this when you want to derive `configs/higgs_production/multiclass_finetuning/*.yaml` from completed multiclass pretraining runs:
-
-```bash
-python scripts/generate_multiclass_finetuning_configs.py
-```
-
-Before using the generated configs, verify that the chosen best-epoch checkpoint paths still exist.
diff --git a/root_gnn_dgl/.codex/skills/root-gnn-dgl-workflow/SKILL.md b/root_gnn_dgl/.codex/skills/root-gnn-dgl-workflow/SKILL.md
deleted file mode 100644
index 26d40bebf5c1751025c8ad421883e70a6a892b16..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/.codex/skills/root-gnn-dgl-workflow/SKILL.md
+++ /dev/null
@@ -1,68 +0,0 @@
----
-name: root-gnn-dgl-workflow
-description: Use when the user asks to run or review the full root_gnn_dgl workflow, such as run_demo.sh, a full pretraining-to-finetuning-to-inference campaign, or a stage-by-stage pass, warning, fail audit across environment setup, data preparation, training, inference, and outputs.
----
-
-# root-gnn-dgl-workflow
-
-Use this skill when the user wants an end-to-end workflow rather than a single isolated stage.
-
-## Shipped demo
-
-Run from the repo root:
-
-```bash
-source run_demo.sh
-```
-
-The demo does:
-
-1. graph prep for multiclass pretraining
-2. multiclass pretraining
-3. graph prep for binary classification
-4. from-scratch binary training
-5. finetuned binary training
-6. inference with two output score branches
-
-## Before running the workflow
-
-- Check GPU availability with `nvidia-smi` or request an interactive node with `jobs/interactive.sh`.
-- Confirm the target data and output directories in `run_demo.sh` exist and are writable.
-- Confirm `configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml` points at the checkpoint you actually want to finetune from.
-
-## Workflow audit order
-
-When reviewing a campaign, check stages in this order:
-
-1. environment readiness
-2. data-prep outputs
-3. training submission and queue state
-4. training logs and checkpoints
-5. inference outputs
-
-Use the retained stage skills for each check:
-
-- `root-gnn-dgl-env-setup`
-- `root-gnn-dgl-data-preparation`
-- `root-gnn-dgl-training`
-- `root-gnn-dgl-inference`
-
-## Output style
-
-Return a short status for each stage:
-
-- `pass`: evidence is consistent with a healthy stage
-- `warning`: stage likely worked but still needs a follow-up check
-- `fail`: concrete blocker or corrupted or missing artifact found
-
-Repo-specific workflow blockers:
-
-- pending jobs in `sqs` with `Priority` are waiting, not failed
-- missing prebatched `.bin` files block `--preshuffle` training
-- missing `pretraining_path` blocks finetuning
-- ROOT outputs without the requested score branches are inference failures even if the file exists
-
-## When to adapt instead of sourcing the demo
-
-- If you only want one stage, call the underlying prep, training, or inference script directly.
-- If dataset paths, branch names, or chunk counts differ, copy the command pattern from `run_demo.sh` and adjust the values instead of editing the demo in place.
diff --git a/root_gnn_dgl/README.md b/root_gnn_dgl/README.md
index 22b296eae1c9a617a23ee14982a615389978546d..7337e9ed4e171ccfae13100f48c23a4afbe3ac89 100644
--- a/root_gnn_dgl/README.md
+++ b/root_gnn_dgl/README.md
@@ -1,68 +1,53 @@
-
 # root_gnn_dgl
 
-Pretrained DGL-based ROOT graph neural network. 
+## Data Directory (for Hackathon)
+`/global/cfs/projectdirs/trn007/lbl_atlas/data/`
+
+* `stats_all`: full statistics sample, ~10M events per process
+* `stats_100K`: reduced statistics sample, 100K events per process
+* `processed_graphs`: graphs that have already been processed
+* `scores`: a copy of the samples along with the GNN scores for each event
 
-Pretrained model location: `/global/cfs/projectdirs/atlas/joshua/Pretrained_GNN/multiclass_pretrained_model_12/`
-To use the pretrained model, take a look at a finetuning config in `configs`.
-Replace `pretraining_path:` with `/global/cfs/projectdirs/atlas/joshua/Pretrained_GNN/multiclass_pretrained_model_12/model_epoch_71.pt`.
+## Environment Setup
 
-## Overview
-- Stable release with pretrained model weights.
+The environment dependencies for this project are listed in `setup/environment.yml`. Follow the steps below to set up the environment:
 
-## Conda setup
+### Step 1: Install Conda
+If you don’t already have Conda installed, install either Miniconda (lightweight) or Anaconda (full version):
 
-The conda environment is required for the inference step: applying the GNN onto root files and saving GNN scores as an additional branch. This is because the infereces script uses pyROOT.
+- **Miniconda**: Download and install from [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html).
+- **Anaconda**: Download and install from [https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution).
 
+### Step 2: Clone the Repository
+Clone this repository to your local machine:
 ```bash
-cd setup
-conda env create -f environment.yml
-conda activate pytorch
-cd ..
-python setup/test_setup.py 
+git init
+git lfs install
+git clone https://huggingface.co/HWresearch/GNN4Colliders
 ```
-
-##  Container Setup (Podman-HPC)
-
-- NERSC Perlmutter environment with `podman-hpc` available.
-- Access to `joshuaho/pytorch:1.0` on Docker Hub [https://hub.docker.com/r/joshuaho/pytorch](https://hub.docker.com/r/joshuaho/pytorch)
-
-The inference step requires the conda environment, since the container does not contain ROOT.
-
-### Pull the Prebuilt Image
-
+If you want to clone without large files - just their pointers
 ```bash
-podman-hpc pull docker.io/joshuaho/pytorch:1.0
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HWresearch/GNN4Colliders
 ```
 
-Or, you can build your own container here:
-
+### Step 3: Create the Conda Environment
+Use the `environment.yml` file to create the Conda environment:
 ```bash
-cd setup
-source build_image.sh
+conda env create -f setup/environment.yml -n <environment_name>
 ```
 
-Run the image and mount the paths you need, replaceing `<source>` with source directory path and `<target>` with the path for when you are inside the container.
+### Step 4: Activate the Environment
+Activate the newly created environment:
 ```bash
+conda activate <environment_name>
+```
+Replace <environment_name> with the name of the environment specified in Step 4.
 
-podman-hpc run \
-  -it \
-  --mount type=bind,source=<source>,target=<target> \
-  --rm \
-  --network host \
-  --gpu \
-  --userns keep-id \
-  --shm-size=32g \
-  joshuaho/pytorch:1.0
-  ```
-
-### Test the Environment
+### Step 5: Test the Environment
 Run the `setup/test_setup.py` script to confirm that all packages needed for training are properly set up.
 ```bash
 python setup/test_setup.py
 ```
-
-
 ## Running the Demo
 The demo training is an example of our ML workflow, consisting of training a pretrained model, then finetuning it for an analysis task, while also training a model for the analysis task from scratch. The config files for the demo are located in the directory `configs/stats_100K/`. The demo can be run on a login node on Perlmutter (if enough GPU memory is availble).
 
@@ -104,8 +89,6 @@ dgl.save_graphs(str(graph_path).replace('.bin', f'_{self.process_chunks[i]}.bin'
 IndexError: list index out of range
 ``` 
 
-To make sure you have all the necessary graphs to train, you can use the `scripts/check_dataset_files.py` script to ensure all graphs are properly processed. Using the `--rerun` runtime arguement will tell the script to automically re-processes any missing files.
-
 ## Training
 Training is run by `scripts/training_script`. `--preshuffle` tells it to use the preshuffled and batched graphs rather than shuffling and batching on the fly, and `--restart` can be used to force the training to start from the beginning rather than from the last available checkpoint.
 
@@ -122,16 +105,16 @@ Inference is done by `scripts/inference.py`. This script applies the model defin
 
 ```bash
 python scripts/inference.py \
-    --target "/global/cfs/projectdirs/atlas/joshua/gnn_data/stats_100K/ttH_NLO.root" \
-    --destination "/global/cfs/projectdirs/atlas/joshua/gnn_data/scores/stats_100K/ttH_NLO.root" \
+    --target "/global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/ttH_NLO.root" \
+    --destination "/global/cfs/projectdirs/trn007/lbl_atlas/data/scores/stats_100K/ttH_NLO.root" \
     --config "configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml" \
-    --branch_name "GNN_Score" \
     --chunks 1 \
     --chunkno 0 \
-    --write
+    --write \
+    --branch 'GNN_Score'
 ```
 
-You can also input a list as the `--config` and the `--branch_name` to simultaneously apply multiple models onto the same set of samples. An example on how to do this in shell script is in the `run_demo.sh` file.
+You can also input a list as the `--config` and the `--branch` to simultaneously apply multiple models onto the same set of samples. An example on how to do this in shell script is in the `run_demo.sh` file.
 
 ## Running Jobs + Parallelization
 
diff --git a/root_gnn_dgl/configs/attention/ttH_CP_even_vs_odd.yaml b/root_gnn_dgl/configs/attention/ttH_CP_even_vs_odd.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..dc124be431b366eed524c31ac6fc11c2325f4f42
--- /dev/null
+++ b/root_gnn_dgl/configs/attention/ttH_CP_even_vs_odd.yaml
@@ -0,0 +1,58 @@
+Training_Name: ttH_CP_even_vs_odd
+Training_Directory: trainings/attention/ttH_CP_even_vs_odd
+Model:
+  module: models.GCN
+  class: Attention_Edge_Network
+  args:
+    hid_size: 64
+    in_size: 7
+    out_size: 1
+    n_layers: 4
+    n_proc_steps: 4
+    dropout: 0
+    num_heads: 2
+Training:
+  epochs: 500
+  batch_size: 1024
+  learning_rate: 0.0001
+  gamma: 0.99
+Datasets:
+  ttH_CP_even: &dataset_defn
+    module: root_gnn_base.dataset
+    class: LazyDataset
+    shuffle_chunks: 3
+    batch_size: 1024
+    padding_mode: NODE
+    args: &dataset_args
+      name: ttH_CP_even
+      label: 0
+      # weight_var: weight
+      chunks: 3
+      buffer_size: 2
+      file_names: ttH_NLO.root
+      tree_name: output
+      fold_var: Number
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/attention/ttH_CP_even_vs_odd/
+      node_branch_names:
+        - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
+        - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
+        - [jet_phi, ele_phi, mu_phi, ph_phi, MET_phi]
+        - CALC_E
+        - [jet_btag, 0, 0, 0, 0]
+        - [0, ele_charge, mu_charge, 0, 0]
+        - NODE_TYPE
+      node_branch_types: [vector, vector, vector, vector, single]
+      node_feature_scales: [1e-1, 1, 1, 1e-1, 1, 1, 1]
+    folding:
+      n_folds: 4
+      test: [0]
+      # validation: 1
+      train: [1, 2, 3]
+  ttH_CP_odd:
+    <<: *dataset_defn
+    args:
+      <<: *dataset_args
+      name: ttH_CP_odd
+      label: 1
+      file_names: ttH_CPodd.root
diff --git a/root_gnn_dgl/configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml b/root_gnn_dgl/configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml
index 3f0330ec370d3405e8bcc9408f23458122b7c375..ae56b014189afbba4886f421dda9aad12f6d467c 100755
--- a/root_gnn_dgl/configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml
+++ b/root_gnn_dgl/configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml
@@ -23,7 +23,7 @@ Model:
 Training:
   epochs: 500
   batch_size: 1024
-  learning_rate: 0.00001
+  learning_rate: 0.0001
   gamma: 0.99
 Datasets:
   ttH_CP_even: &dataset_defn
@@ -41,8 +41,8 @@ Datasets:
       file_names: ttH_NLO.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_100K/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_100K/ttH_CP_even_vs_odd/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_100K/ttH_CP_even_vs_odd/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
diff --git a/root_gnn_dgl/configs/stats_100K/pretraining_multiclass.yaml b/root_gnn_dgl/configs/stats_100K/pretraining_multiclass.yaml
index 7e2d52139473a1b36ce2a67263489d8f8b5c4a04..c182c8588431ebd1016178fc7f8b9c1dda543a0b 100644
--- a/root_gnn_dgl/configs/stats_100K/pretraining_multiclass.yaml
+++ b/root_gnn_dgl/configs/stats_100K/pretraining_multiclass.yaml
@@ -38,8 +38,8 @@ Datasets:
       file_names: ttH_NLO_inc.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_100K/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_100K/pretraining_multiclass/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_100K/pretraining_multiclass/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
@@ -94,7 +94,7 @@ Datasets:
     <<: *dataset_defn
     args: 
       <<: *dataset_args
-      name: ttyy
+      name: ttyy_ch
       label: 6
       file_names: 'ttyy.root'
   tttt:
diff --git a/root_gnn_dgl/configs/stats_100K/ttH_CP_even_vs_odd.yaml b/root_gnn_dgl/configs/stats_100K/ttH_CP_even_vs_odd.yaml
index bb9db04baf543be9b8f3fa7b8d1f4fbce7f6e1a5..5fcb903b08a68a8640bf65ee4aab0c81a9d91d66 100755
--- a/root_gnn_dgl/configs/stats_100K/ttH_CP_even_vs_odd.yaml
+++ b/root_gnn_dgl/configs/stats_100K/ttH_CP_even_vs_odd.yaml
@@ -31,8 +31,8 @@ Datasets:
       file_names: ttH_NLO.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_100K/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_100K/ttH_CP_even_vs_odd/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_100K/ttH_CP_even_vs_odd/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
diff --git a/root_gnn_dgl/configs/stats_all/finetuning_ttH_CP_even_vs_odd.yaml b/root_gnn_dgl/configs/stats_all/finetuning_ttH_CP_even_vs_odd.yaml
index 2ce7178d9441e2ed4dc67cd85bc22bfa6c9756b3..4ea8a0f4e8f4c1fb2467574a7f3730a7cf0768e9 100755
--- a/root_gnn_dgl/configs/stats_all/finetuning_ttH_CP_even_vs_odd.yaml
+++ b/root_gnn_dgl/configs/stats_all/finetuning_ttH_CP_even_vs_odd.yaml
@@ -20,11 +20,6 @@ Model:
     n_layers: 4
     n_proc_steps: 4
     dropout: 0
-Training:
-  epochs: 500
-  batch_size: 1024
-  learning_rate: 0.00001
-  gamma: 0.99
 Datasets:
   ttH_CP_even: &dataset_defn
     module: root_gnn_base.dataset
@@ -41,8 +36,8 @@ Datasets:
       file_names: ttH_NLO.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_all/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_all/ttH_CP_even_vs_odd/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_all/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_all/ttH_CP_even_vs_odd/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
diff --git a/root_gnn_dgl/configs/stats_all/pretraining_multiclass.yaml b/root_gnn_dgl/configs/stats_all/pretraining_multiclass.yaml
index 5353fb4bb6fb119b621239b0f745b74f1281d700..8cc480f9f3956ee05dcc033f89ed05d217658de2 100644
--- a/root_gnn_dgl/configs/stats_all/pretraining_multiclass.yaml
+++ b/root_gnn_dgl/configs/stats_all/pretraining_multiclass.yaml
@@ -38,8 +38,8 @@ Datasets:
       file_names: ttH_NLO_inc.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_all/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_all/pretraining_multiclass/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_all/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_all/pretraining_multiclass/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
diff --git a/root_gnn_dgl/configs/stats_all/ttH_CP_even_vs_odd.yaml b/root_gnn_dgl/configs/stats_all/ttH_CP_even_vs_odd.yaml
index d5fd37c92aef5fd6f3db634f0793e40aae814f67..0d074acab903084ce8895a47ba231f28732f9094 100755
--- a/root_gnn_dgl/configs/stats_all/ttH_CP_even_vs_odd.yaml
+++ b/root_gnn_dgl/configs/stats_all/ttH_CP_even_vs_odd.yaml
@@ -31,8 +31,8 @@ Datasets:
       file_names: ttH_NLO.root
       tree_name: output
       fold_var: Number
-      raw_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/stats_all/
-      save_dir: /global/cfs/projectdirs/atlas/joshua/gnn_data/processed_graphs/stats_all/ttH_CP_even_vs_odd/
+      raw_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/stats_all/
+      save_dir: /global/cfs/projectdirs/trn007/lbl_atlas/data/processed_graphs/stats_all/ttH_CP_even_vs_odd/
       node_branch_names:
         - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
         - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
diff --git a/root_gnn_dgl/jobs/cpu.sh b/root_gnn_dgl/jobs/cpu.sh
index 71a4f508a866cadd4de28a9f55df29422354d62a..ae17a41d66445d21ba5ccfb636ccda113b921eb6 100644
--- a/root_gnn_dgl/jobs/cpu.sh
+++ b/root_gnn_dgl/jobs/cpu.sh
@@ -1 +1 @@
-salloc --nodes=1 --ntasks=64 --cpus-per-task=1 --qos=interactive --time=04:00:00 --constraint=cpu --account=atlas
+salloc --nodes=1 --ntasks=64 --cpus-per-task=1 --qos=interactive --time=04:00:00 --constraint=cpu --account=trn007
diff --git a/root_gnn_dgl/jobs/interactive.sh b/root_gnn_dgl/jobs/interactive.sh
index 4ba4ad1657811203952376dad4b32d2cbabbb596..8197dd547c8fd7292172146af10d734a68ddaadb 100644
--- a/root_gnn_dgl/jobs/interactive.sh
+++ b/root_gnn_dgl/jobs/interactive.sh
@@ -1 +1 @@
-salloc --nodes 1 --qos shared_interactive --time 04:00:00 --constraint gpu --account=atlas --gres=gpu:1
+salloc --nodes 1 --qos shared_interactive --time 04:00:00 --constraint gpu --account=trn007 --gres=gpu:1
diff --git a/root_gnn_dgl/jobs/prep_data/run_processing.py b/root_gnn_dgl/jobs/prep_data/run_processing.py
index abfe81d0a74f6fa3efb90b6ef1df894866aba52a..3cf70ffaf38af5bc5484dae3a537ea70be771d67 100644
--- a/root_gnn_dgl/jobs/prep_data/run_processing.py
+++ b/root_gnn_dgl/jobs/prep_data/run_processing.py
@@ -79,10 +79,7 @@ def main():
         # "configs/stats_100K/ttH_CP_even_vs_odd.yaml",
         # "configs/stats_all/pretraining_multiclass.yaml",
         # "configs/stats_all/ttH_CP_even_vs_odd.yaml",
-        # "configs/attention/ttH_CP_even_vs_odd.yaml",
-        "configs/stats_all/ttH_CP_even_vs_odd_batch_size_2048.yaml",
-        "configs/stats_all/ttH_CP_even_vs_odd_batch_size_4096.yaml",
-        "configs/stats_all/ttH_CP_even_vs_odd_batch_size_8192.yaml",
+        "configs/attention/ttH_CP_even_vs_odd.yaml",
     ]
 
     # Path to the bash script to be called
diff --git a/root_gnn_dgl/jobs/salloc.sh b/root_gnn_dgl/jobs/salloc.sh
index c92388e06b720ddde3849efdbd6285f273bacd1d..5c7f378bda998338b844af9fe35493993432518d 100644
--- a/root_gnn_dgl/jobs/salloc.sh
+++ b/root_gnn_dgl/jobs/salloc.sh
@@ -1 +1 @@
-salloc --nodes 4 --qos interactive --time 04:00:00 --constraint gpu --account=atlas --gres=gpu:4
+salloc --nodes 4 --qos interactive --time 04:00:00 --constraint gpu --account=trn007 --gres=gpu:4
diff --git a/root_gnn_dgl/jobs/training/conda/run_job.sh b/root_gnn_dgl/jobs/training/conda/run_job.sh
deleted file mode 100755
index 821f16fc265a3ce3e41c9062f391779527554c78..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/jobs/training/conda/run_job.sh
+++ /dev/null
@@ -1,28 +0,0 @@
-#!/bin/bash
-#SBATCH -N 1
-#SBATCH -C gpu
-#SBATCH -q shared
-#SBATCH -t 15:00:00
-#SBATCH -A atlas
-#SBATCH -o /global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/jobs/slurm/%j.out # STDOUT
-
-CONFIG=$1
-shift
-ARGUEMENTS="$*"
-
-DIRECTORY="/global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/"
-BASE_COMMAND="$DIRECTORY/training_script.py $ARGUEMENTS --preshuffle --nocompile --lazy --config $DIRECTORY"
-
-echo "launched image"
-cd $DIRECTORY
-
-eval "$(conda shell.bash hook)"
-conda init bash
-conda activate /opt/conda/envs/dgl
-
-COMMAND="$BASE_COMMAND$CONFIG"
-
-echo "Running my script now"
-echo $COMMAND
-python -u $COMMAND
-echo "Done"
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/multinode/run_multinode_1.sh b/root_gnn_dgl/jobs/training/multinode/run_multinode_1.sh
new file mode 100755
index 0000000000000000000000000000000000000000..f15bbf9051b700e77344bffaad4670c856538092
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/multinode/run_multinode_1.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+#SBATCH -C gpu
+#SBATCH -N 4
+#SBATCH --gres=gpu:4
+#SBATCH -q regular
+#SBATCH --mail-user=ho22joshua@berkeley.edu
+#SBATCH --mail-type=ALL
+#SBATCH -t 05:00:00
+#SBATCH -A atlas
+#SBATCH -o /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/slurm/%j.out # STDOUT
+
+
+CONFIG="$*"
+
+echo "Executing command: $CONFIG"
+
+cd /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/
+
+eval "$(conda shell.bash hook)"
+conda init bash
+conda activate /global/homes/j/joshuaho/.conda/envs/dgl
+
+# Run the Python script and capture the output
+MASTER_PORT=$(python /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/scripts/find_free_port.py)
+
+# export MASTER_ADDR=$(hostname)
+export MASTER_PORT=$MASTER_PORT
+
+# Dynamically get the hostname of the first node in the allocation to use as MASTER_ADDR
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+
+# Debugging: Print the master address and port
+echo "Master Address: $MASTER_ADDR"
+echo "Master Port: $MASTER_PORT"
+
+TORCHRUN_ARGUMENTS="--nnodes=$SLURM_NNODES \
+                    --nproc-per-node=$SLURM_GPUS_ON_NODE \
+                    --rdzv-id=$SLURM_JOB_ID \
+                    --rdzv-backend=c10d \
+                    --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT"
+
+GPUS=$(( $SLURM_NNODES * 4 ))
+
+echo GPUS: $GPUS
+
+srun --gpus=$GPUS \
+    --ntasks-per-node=1 \
+    /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/launch_image.sh \
+    "--entrypoint /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/run_multinode_2.sh" \
+    $TORCHRUN_ARGUMENTS \
+    $CONFIG
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/multinode/run_multinode_2.sh b/root_gnn_dgl/jobs/training/multinode/run_multinode_2.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2b890388629f5c2b7d2b7a4498cddca827986bfd
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/multinode/run_multinode_2.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+TORCHRUN_ARGUMENTS="$1 $2 $3 $4 $5"
+shift 5
+TRAIN_ARGUMENTS="$@"
+
+# Print the entire argument string
+echo "TORCHRUN_ARGUMENTS: $TORCHRUN_ARGUMENTS"
+echo "TRAIN_ARGUMENTS: $TRAIN_ARGUMENTS"
+
+cd /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/
+
+DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
+COMMAND="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/scripts/training_script.py --preshuffle --nocompile --lazy --config $DIRECTORY$TRAIN_ARGUMENTS"
+
+eval "$(conda shell.bash hook)"
+conda init bash
+conda activate /opt/conda/envs/dgl
+
+echo $COMMAND
+
+torchrun \
+    $TORCHRUN_ARGUMENTS \
+    $COMMAND
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/multinode/run_multinode_3.sh b/root_gnn_dgl/jobs/training/multinode/run_multinode_3.sh
new file mode 100755
index 0000000000000000000000000000000000000000..1c9c31abb71f95201b8e9c3c9c16d44269968f1e
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/multinode/run_multinode_3.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+CONFIG=$1
+shift
+ARGUEMENTS="$*"
+
+
+DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
+BASE_COMMAND="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/scripts/training_script.py $ARGUEMENTS --preshuffle --nocompile --lazy --config $DIRECTORY"
+
+echo "launched image"
+cd /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/
+
+COMMAND="$BASE_COMMAND$CONFIG"
+
+eval "$(conda shell.bash hook)"
+conda init bash
+conda activate /opt/conda/envs/dgl
+
+echo "Running my script now"
+python $COMMAND
+
+
+echo "Done"
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/multinode/submit.sh b/root_gnn_dgl/jobs/training/multinode/submit.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e0a24828805c5e7585e5f90b0c7a1b0e979ca6cf
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/multinode/submit.sh
@@ -0,0 +1,21 @@
+date
+
+DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
+
+configs=(
+  'pretraining_multilabel/multilabel_5_particle_counting.yaml --restart --multinode'
+  'pretraining_multilabel/multilabel_17_higgs_kinematics.yaml --restart --multinode'
+  'pretraining_multilabel/multilabel_29_top_kinematics.yaml --restart --multinode'
+  'pretraining_multilabel/multilabel_41_higgs_tops_all_kinematics.yaml --restart --multinode'
+)
+
+counter=0
+
+for job in "${configs[@]}"
+do
+  sbatch --job-name="$job" \
+        /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/run_multinode_1.sh "$job"
+  ((counter++))
+done
+
+echo "Total jobs submitted: $counter"
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/podman/run_job.sh b/root_gnn_dgl/jobs/training/podman/run_job.sh
deleted file mode 100755
index 98c4529b142e3df76193a15ed9c95d18f9b81e34..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/jobs/training/podman/run_job.sh
+++ /dev/null
@@ -1,14 +0,0 @@
-#!/bin/bash
-#SBATCH -N 1
-#SBATCH -C "gpu&hbm80g"
-#SBATCH -q shared
-#SBATCH -t 24:00:00
-#SBATCH -A atlas
-#SBATCH -o /global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/jobs/slurm/%j.out # STDOUT
-
-ARGUEMENTS="$*"
-
-echo "Arguements: $ARGUEMENTS"
-
-## create a launch image script
-source "/global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/setup/launch_image.sh" "/global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/jobs/training/podman/run_job_image.sh" $ARGUEMENTS
diff --git a/root_gnn_dgl/jobs/training/podman/run_job_image.sh b/root_gnn_dgl/jobs/training/podman/run_job_image.sh
deleted file mode 100755
index 9fd14036e16f6f3d12eaf86239f040f43d083f00..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/jobs/training/podman/run_job_image.sh
+++ /dev/null
@@ -1,31 +0,0 @@
-#!/bin/bash
-
-CONFIG=$1
-shift
-# Store any other potential arguments safely
-OTHER_ARGUEMENTS=("$@")
-
-DIRECTORY="/global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/"
-
-cd $DIRECTORY
-
-# Use a bash array to build the command and its arguments
-# Each element in the () is a separate argument.
-COMMAND_ARGS=(
-    "$DIRECTORY/scripts/training_script.py"
-    "${OTHER_ARGUEMENTS[@]}"
-    "--preshuffle"
-    "--nocompile"
-    "--lazy"
-    "--config"
-    "$DIRECTORY$CONFIG"
-)
-
-echo "Running my script now"
-# Using "@" in quotes expands the array correctly
-echo "Executing: python -u ${COMMAND_ARGS[@]}"
-
-# The "${COMMAND_ARGS[@]}" syntax ensures each element is passed as a distinct argument
-python -u "${COMMAND_ARGS[@]}"
-
-echo "Done"
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/podman/submit.sh b/root_gnn_dgl/jobs/training/podman/submit.sh
deleted file mode 100644
index 526f548c1ca1c28aafed0c7be97f02a0bf838bca..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/jobs/training/podman/submit.sh
+++ /dev/null
@@ -1,61 +0,0 @@
-date
-
-DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
-
-configs=(  
-  # "configs/multiclass_pretraining/baseline.yaml"
-  # "configs/multiclass_pretraining/pretraining_batch_size/multiclass_bs_4096.yaml"
-  # "configs/multiclass_pretraining/pretraining_hid_size/multiclass_hid_256.yaml"
-  # "configs/multiclass_pretraining/pretraining_lr/multiclass_lr_1e2.yaml"
-  # "configs/multiclass_pretraining/pretraining_n_layers/multiclass_layers_6.yaml"
-
-  # "configs/multiclass_pretraining/pretraining_batch_size/multiclass_bs_2048.yaml"
-  # "configs/multiclass_pretraining/pretraining_hid_size/multiclass_hid_128.yaml"
-  # "configs/multiclass_pretraining/pretraining_lr/multiclass_lr_1e3.yaml"
-  # "configs/multiclass_pretraining/pretraining_n_layers/multiclass_layers_5.yaml"
-
-  # "configs/higgs_production/baseline.yaml"
-  # "configs/higgs_production/higgs_production_batch_size/higgs_production_bs_4096.yaml"
-  # "configs/higgs_production/higgs_production_hid_size/higgs_production_hid_256.yaml"
-  # "configs/higgs_production/higgs_production_lr/higgs_production_lr_1e2.yaml"
-  # "configs/higgs_production/higgs_production_n_layers/higgs_production_layers_6.yaml"
-
-  # "configs/higgs_production/higgs_production_batch_size/higgs_production_bs_2048.yaml"
-  # "configs/higgs_production/higgs_production_hid_size/higgs_production_hid_128.yaml"
-  # "configs/higgs_production/higgs_production_lr/higgs_production_lr_1e3.yaml"
-  # "configs/higgs_production/higgs_production_n_layers/higgs_production_layers_5.yaml"
-  # "configs/higgs_production/baseline2.yaml"
-  # "configs/higgs_production/baseline3.yaml"
-  # "configs/higgs_production/baseline4.yaml"
-  # "configs/higgs_production/baseline5.yaml"
-  "configs/higgs_production/multiclass_finetuning/baseline.yaml"
-  "configs/higgs_production/multiclass_finetuning/baseline_lr_1e4.yaml"
-  "configs/higgs_production/multiclass_finetuning/baseline_lr_1e6.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_128.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_128_lr_1e4.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_128_lr_1e6.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_256.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_256_lr_1e4.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_hid_256_lr_1e6.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_layers_6.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_layers_6_lr_1e4.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_layers_6_lr_1e6.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_lr_1e3.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_lr_1e3_lr_1e4.yaml"
-  "configs/higgs_production/multiclass_finetuning/multiclass_lr_1e3_lr_1e6.yaml"
-
-
-)
-
-counter=0
-
-hours=24
-time="${hours}:00:00"
-
-for job in "${configs[@]}"
-do
-  sbatch --job-name="$job" --time="$time" /global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/jobs/training/podman/run_job.sh "$job"
-  ((counter++))
-done
-
-echo "Total jobs submitted: $counter"
diff --git a/root_gnn_dgl/jobs/training/singlegpu/run_job.sh b/root_gnn_dgl/jobs/training/singlegpu/run_job.sh
new file mode 100755
index 0000000000000000000000000000000000000000..cf4317b17682f4e6f16b4f406eeb4d19dcc2d541
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/singlegpu/run_job.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+#SBATCH -N 1
+#SBATCH -C gpu
+#SBATCH -q shared
+#SBATCH --mail-user=ho22joshua@berkeley.edu
+#SBATCH --mail-type=ALL
+#SBATCH -t 15:00:00
+#SBATCH -A atlas
+#SBATCH -o /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/slurm/%j.out # STDOUT
+
+ARGUEMENTS="$*"
+
+echo "Arguements: $ARGUEMENTS"
+echo "launching image"
+source launch_image.sh "--entrypoint /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/run_job_image.sh" $ARGUEMENTS
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/singlegpu/run_job_image.sh b/root_gnn_dgl/jobs/training/singlegpu/run_job_image.sh
new file mode 100755
index 0000000000000000000000000000000000000000..5f213b5e478703af57bcd897815abc45bde63793
--- /dev/null
+++ b/root_gnn_dgl/jobs/training/singlegpu/run_job_image.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+CONFIG=$1
+shift
+ARGUEMENTS="$*"
+
+DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
+BASE_COMMAND="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/scripts/training_script.py $ARGUEMENTS --preshuffle --nocompile --lazy --config $DIRECTORY"
+
+echo "launched image"
+cd /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/
+
+COMMAND="$BASE_COMMAND$CONFIG"
+
+eval "$(conda shell.bash hook)"
+conda init bash
+conda activate /opt/conda/envs/dgl
+
+echo "Running my script now"
+echo $COMMAND
+python -u $COMMAND
+echo "Done"
\ No newline at end of file
diff --git a/root_gnn_dgl/jobs/training/conda/submit.sh b/root_gnn_dgl/jobs/training/singlegpu/submit.sh
similarity index 50%
rename from root_gnn_dgl/jobs/training/conda/submit.sh
rename to root_gnn_dgl/jobs/training/singlegpu/submit.sh
index c7d2b55a94fec0ae4a57d69d224fc539b37a3105..474bcd0b91cb0a56e0fa9f14c4ab4c8557334343 100644
--- a/root_gnn_dgl/jobs/training/conda/submit.sh
+++ b/root_gnn_dgl/jobs/training/singlegpu/submit.sh
@@ -3,10 +3,7 @@ date
 DIRECTORY="/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/configs/model_configs/"
 
 configs=(  
-  "configs/stats_all/ttH_CP_even_vs_odd.yaml"
-  "configs/stats_all/ttH_CP_even_vs_odd_batch_size_2048.yaml"
-  "configs/stats_all/ttH_CP_even_vs_odd_batch_size_4096.yaml"
-  "configs/stats_all/ttH_CP_even_vs_odd_batch_size_8192.yaml"
+  "run_3_ttH/v05/sb_yukawa_cp_abs_weights.yaml --abs"
 )
 
 counter=0
@@ -16,7 +13,7 @@ time="${hours}:00:00"
 
 for job in "${configs[@]}"
 do
-  sbatch --job-name="$job" --time="$time" /global/cfs/projectdirs/atlas/joshua/gnn/root_gnn_dgl/jobs/training/conda/run_job.sh "$job"
+  sbatch --job-name="$job" --time="$time" /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/jobs/run_job.sh "$job"
   ((counter++))
 done
 
diff --git a/root_gnn_dgl/models/GCN.py b/root_gnn_dgl/models/GCN.py
index 35b09bf1259a80b03db873ffc824f982ce435ded..b4fd1b47bc0d0115c9a41dd52cece8e85dc40fc9 100755
--- a/root_gnn_dgl/models/GCN.py
+++ b/root_gnn_dgl/models/GCN.py
@@ -1154,7 +1154,6 @@ class Attention(nn.Module):
         self.n_proc_steps = n_proc_steps
         self.layers = nn.ModuleList()
         self.has_global = sample_global.shape[1] != 0
-        self.hid_size = hid_size
         gl_size = sample_global.shape[1] if self.has_global else 1
 
         #encoder
@@ -1197,7 +1196,7 @@ class Attention(nn.Module):
                 batch_num_nodes.append(non_padded_count)
                 start_idx = end_idx
             batch_num_nodes = torch.tensor(batch_num_nodes, device = g.ndata['features'].device)
-            sum_weights = batch_num_nodes[:, None].repeat(1, self.hid_size)
+            sum_weights = batch_num_nodes[:, None].repeat(1, 64)
             global_feats = batch_num_nodes[:, None].to(torch.float)
 
         h_global = self.global_encoder(global_feats)
@@ -1365,7 +1364,6 @@ class Transferred_Learning_Attention(nn.Module):
         self.n_proc_steps = n_proc_steps
         self.layers = nn.ModuleList()
         self.has_global = sample_global.shape[1] != 0
-        self.hid_size = hid_size
         gl_size = sample_global.shape[1] if self.has_global else 1
 
         self.learning_rate = learning_rate
@@ -1442,7 +1440,7 @@ class Transferred_Learning_Attention(nn.Module):
                 batch_num_nodes.append(non_padded_count)
                 start_idx = end_idx
             batch_num_nodes = torch.tensor(batch_num_nodes, device = g.ndata['features'].device)
-            sum_weights = batch_num_nodes[:, None].repeat(1, self.hid_size)
+            sum_weights = batch_num_nodes[:, None].repeat(1, 64)
             global_feats = batch_num_nodes[:, None].to(torch.float)
 
         h_global = self.TL_global_encoder(global_feats)
@@ -1858,7 +1856,6 @@ class Clustering(nn.Module):
         self.n_layers = n_layers
         self.n_proc_steps = n_proc_steps
         self.layers = nn.ModuleList()
-        self.hid_size = hid_size
         if (len(sample_global) == 0):
             self.has_global = False
         else:
@@ -1902,7 +1899,7 @@ class Clustering(nn.Module):
                 batch_num_nodes.append(non_padded_count)
                 start_idx = end_idx
             batch_num_nodes = torch.tensor(batch_num_nodes, device = g.ndata[features].device)
-            sum_weights = batch_num_nodes[:, None].repeat(1, self.hid_size)
+            sum_weights = batch_num_nodes[:, None].repeat(1, 64)
             global_feats = batch_num_nodes[:, None].to(torch.float)
 
         h_global = self.global_encoder(global_feats)
diff --git a/root_gnn_dgl/profile.sh b/root_gnn_dgl/profile.sh
deleted file mode 100644
index 9d3b20f528c749c34c3de3fa4389e95a16300967..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/profile.sh
+++ /dev/null
@@ -1,35 +0,0 @@
-nsys profile \
-  -o /pscratch/sd/j/joshuaho/full_stats_profile_1_gpu_batch_size_1028 \
-  --capture-range=cudaProfilerApi \
-  --duration=100 \
-  --force-overwrite true \
-  --trace=nvtx \
-  --cudabacktrace=all \
-  python scripts/training_script.py --config configs/stats_all/ttH_CP_even_vs_odd.yaml --preshuffle --nocompile --lazy --restart --profile
-
-nsys profile \
-  -o /pscratch/sd/j/joshuaho/full_stats_profile_1_gpu_batch_size_2048 \
-  --capture-range=cudaProfilerApi \
-  --duration=100 \
-  --force-overwrite true \
-  --trace=nvtx \
-  --cudabacktrace=all \
-  python scripts/training_script.py --config configs/stats_all/ttH_CP_even_vs_odd_batch_size_2048.yaml --preshuffle --nocompile --lazy --restart --profile
-
-nsys profile \
-  -o /pscratch/sd/j/joshuaho/full_stats_profile_1_gpu_batch_size_4096 \
-  --capture-range=cudaProfilerApi \
-  --duration=100 \
-  --force-overwrite=true \
-  --trace=nvtx \
-  --cudabacktrace=all \
-  python scripts/training_script.py --config configs/stats_all/ttH_CP_even_vs_odd_batch_size_4096.yaml --preshuffle --nocompile --lazy --restart --profile
-
-nsys profile \
-  -o /pscratch/sd/j/joshuaho/full_stats_profile_1_gpu_batch_size_8192 \
-  --capture-range=cudaProfilerApi \
-  --duration=100 \
-  --force-overwrite true \
-  --trace=nvtx \
-  --cudabacktrace=all \
-  python scripts/training_script.py --config configs/stats_all/ttH_CP_even_vs_odd_batch_size_8192.yaml --preshuffle --nocompile --lazy --restart --profile
diff --git a/root_gnn_dgl/root_gnn_base/batched_dataset.py b/root_gnn_dgl/root_gnn_base/batched_dataset.py
index 4abecc5ecaefd811a3ff93e4c4a368710ade4ca7..101ada4a0a7ae4c2fba01068579d190332fbe444 100644
--- a/root_gnn_dgl/root_gnn_base/batched_dataset.py
+++ b/root_gnn_dgl/root_gnn_base/batched_dataset.py
@@ -16,7 +16,7 @@ def GetBatchedLoader(dataset, batch_size, mask_fn = None, drop_last=True, **kwar
 
 #Dataset which contains prebatched shuffled graphs. Cannot be saved to disk, else batching info is lost.
 class PreBatchedDataset(DGLDataset):
-    def __init__(self, start_dataset, batch_size, mask_fn = None, drop_last=True, save_to_disk = True, suffix = '', chunks = 1, chunkno = -1, shuffle = True, padding_mode = 'NONE', hidden_size=64, **kwargs):
+    def __init__(self, start_dataset, batch_size, mask_fn = None, drop_last=True, save_to_disk = True, suffix = '', chunks = 1, chunkno = -1, shuffle = True, padding_mode = 'NONE', **kwargs):
         print(f'Unused kwargs: {kwargs}')
         self.start_dataset = start_dataset
         self.start_dataset.load()
@@ -34,7 +34,6 @@ class PreBatchedDataset(DGLDataset):
         self.suffix = suffix
         self.current_chunk = None
         self.current_chunk_idx = -1
-        self.hid_size = hidden_size
         super().__init__(name = start_dataset.name + '_prebatched_padded', save_dir=start_dataset.save_dir)
 
     def process(self):
@@ -87,7 +86,7 @@ class PreBatchedDataset(DGLDataset):
             for i in range(len(self.graphs)):
                 unbatched_g = dgl.unbatch(self.graphs[i])
                 max_num_nodes = max(g.number_of_nodes() for g in unbatched_g)
-                self.graphs[i] = utils.pad_batch_num_nodes(self.graphs[i], max_num_nodes, hid_size=self.hid_size)
+                self.graphs[i] = utils.pad_batch_num_nodes(self.graphs[i], max_num_nodes)
                 self.batch_num_nodes.append(self.graphs[i].batch_num_nodes())
                 self.batch_num_edges.append(self.graphs[i].batch_num_edges())
         else:
diff --git a/root_gnn_dgl/root_gnn_base/dataset.py b/root_gnn_dgl/root_gnn_base/dataset.py
index 6314996076331b6ea374423656dfd3c50fccc612..dd34b01e3e1267e52ad909ddfefa076df5faf112 100644
--- a/root_gnn_dgl/root_gnn_base/dataset.py
+++ b/root_gnn_dgl/root_gnn_base/dataset.py
@@ -1,7 +1,6 @@
 from dgl.data import DGLDataset
 import dgl
-import uproot
-import awkward as ak
+import ROOT
 import torch
 import os
 import glob
@@ -9,15 +8,13 @@ import time
 import numpy as np
 from root_gnn_base import utils
 
-FEATURE_DTYPE = torch.float32
-
 def node_features_from_tree(ch, node_branch_names, node_branch_types, node_feature_scales):
     lengths = []
     for branch, node_type in zip(node_branch_names[0], node_branch_types):
         if node_type == 'single':
             lengths.append(1)
         elif node_type == 'vector':
-            lengths.append(len(ch[branch]))
+            lengths.append(len(getattr(ch, branch)))
         else:
             print('Unknown node branch type: {}'.format(node_type))
     features = []
@@ -29,7 +26,7 @@ def node_features_from_tree(ch, node_branch_names, node_branch_types, node_featu
             feat = []
             for i, length in enumerate(lengths):
                 feat.extend([i,]*length)
-            features.append(torch.tensor(feat, dtype=FEATURE_DTYPE))
+            features.append(torch.tensor(feat))
             continue
         feat = []
         itype = 0
@@ -41,14 +38,16 @@ def node_features_from_tree(ch, node_branch_names, node_branch_types, node_featu
                 this_type_ends_at = sum(lengths[:itype+1])
                 feat.extend(features[0][this_type_starts_at:this_type_ends_at]*torch.cosh(features[1][this_type_starts_at:this_type_ends_at]))
             elif node_type == 'single':
-                feat.append(ch[branch])
+                feat.append(getattr(ch, branch))
             elif node_type == 'vector':
-                feat.extend(ch[branch])
+                feat.extend(getattr(ch, branch))
             itype += 1
-        features.append(torch.as_tensor(np.asarray(feat, dtype=np.float32), dtype=FEATURE_DTYPE))
+        features.append(torch.tensor(feat))
     return torch.stack(features, dim=1) * node_feature_scales, lengths
 
 def full_connected_graph(n_nodes, self_loops=True):
+    senders = []
+    receivers = []
     senders = np.arange(n_nodes*n_nodes) // n_nodes
     receivers = np.arange(n_nodes*n_nodes) % n_nodes
     if not self_loops and n_nodes > 1:
@@ -60,18 +59,19 @@ def full_connected_graph(n_nodes, self_loops=True):
 def check_selection(ch, selection):
     var, cut, op = selection
     if op == '>':
-        return ch[var] > cut
+        return getattr(ch, var) > cut
     elif op == '<':
-        return ch[var] < cut
+        return getattr(ch, var) < cut
     elif op == '==':
-        return ch[var] == cut
-
+        return getattr(ch, var) == cut
+    
 def check_selections(ch, selections):
     for selection in selections:
         if not check_selection(ch, selection):
             return False
     return True
 
+#Base dataset class for making graphs from ROOT ntuples.
 class RootDataset(DGLDataset):
     def __init__(self, name=None, raw_dir=None, save_dir=None, label=1, file_names = '*.root', node_branch_names=None, node_branch_types=None, node_feature_scales=None, 
                  selections=[], save=True, tree_name = 'nominal_Loose', fold_var = 'eventNumber', weight_var = None, chunks = 1, process_chunks = None, global_features = [], tracking_info = [], **kwargs):
@@ -83,12 +83,12 @@ class RootDataset(DGLDataset):
         self.file_names = file_names
         self.node_branch_names = node_branch_names
         self.node_branch_types = node_branch_types
-        self.node_feature_scales = torch.tensor([float(sf) for sf in node_feature_scales], dtype=FEATURE_DTYPE)
+        self.node_feature_scales = torch.tensor([float(sf) for sf in node_feature_scales])
         self.tree_name = tree_name
         self.fold_var = fold_var
         self.tracking_info = tracking_info
         self.tracking_info.insert(0, fold_var)
-        if weight_var is None:
+        if weight_var == None:
             weight_var = 1
         self.tracking_info.insert(1, weight_var)
         self.global_features = global_features
@@ -116,7 +116,7 @@ class RootDataset(DGLDataset):
                 branches.append(feat)
         for selection in self.selections:
             branches.append(selection[0])
-        return list(set(branches))  # Remove duplicates
+        return branches
 
     def make_graph(self, ch):
         t1 = time.time()
@@ -129,7 +129,7 @@ class RootDataset(DGLDataset):
         self.times[0] += t2 - t1
         self.times[1] += t3 - t2
         return g
-
+    
     def process(self):
         times = [0, 0, 0]
         oldtime = time.time()
@@ -139,21 +139,21 @@ class RootDataset(DGLDataset):
             self.files = []
             for file_name in self.file_names:
                 self.files.extend(glob.glob(os.path.join(self.raw_dir, file_name)))
-        branches = self.get_list_of_branches()
+        self.chain = ROOT.TChain(self.tree_name)
 
-        # Read all files and concatenate arrays
-        arrays = []
-        for file in self.files:
-            with uproot.open(file) as f:
-                arrays.append(f[self.tree_name].arrays(branches, library="ak"))
-        if len(arrays) == 0:
+        if len(self.files) == 0:
             print('No files found in {}'.format(os.path.join(self.raw_dir, self.file_names)))
-            return
-        data = ak.concatenate(arrays, axis=0)
-        n_entries = len(data[branches[0]])
+        for file in self.files:
+            utils.set_timeout(60*2)
+            self.chain.Add(file)
+            utils.unset_timeout()
+        branches = self.get_list_of_branches()
+        self.chain.SetBranchStatus('*', 0)
+        for branch in branches:
+            self.chain.SetBranchStatus(branch, 1)
         newtime = time.time()
         times[0] += newtime - oldtime
-        chunks = np.array_split(np.arange(n_entries), self.chunks)
+        chunks = np.array_split(np.arange(self.chain.GetEntries()), self.chunks)
         chunks = [chunk for i, chunk in enumerate(chunks) if i in self.process_chunks]
 
         self.graph_chunks = []
@@ -162,7 +162,6 @@ class RootDataset(DGLDataset):
         self.global_chunks = []
         chunk_id = -1
         for chunk in chunks:
-            print('Processing chunk {}/{}'.format(chunk_id + 1, len(chunks)), flush=True)
             chunk_id += 1
             graphs = []
             labels = []
@@ -170,30 +169,28 @@ class RootDataset(DGLDataset):
             globals = []
             for ientry in chunk:
                 if (ientry % 10000 == 0):
-                    print('Processing event {}/{}'.format(ientry, n_entries), flush=True)
-                ch = {b: data[b][ientry] for b in branches}
+                    print('Processing event {}/{}'.format(ientry, self.chain.GetEntries()), flush=True)
+                self.chain.GetEntry(ientry)
                 passed = True
                 for selection in self.selections:
-                    if not check_selection(ch, selection):
+                    if not check_selection(self.chain, selection):
                         passed = False
                         continue
                 oldtime = newtime
                 newtime = time.time()
                 times[1] += newtime - oldtime
                 if passed:
-                    graphs.append(self.make_graph(ch))
-                    labels.append(self.label)
-                    tracking.append(torch.zeros(len(self.tracking_info), dtype=FEATURE_DTYPE))
-                    globals.append(torch.zeros(len(self.global_features), dtype=FEATURE_DTYPE))
+                    graphs.append(self.make_graph(self.chain))
+                    labels.append( self.label )
+                    tracking.append(torch.zeros(len(self.tracking_info), dtype=torch.double))
+                    globals.append(torch.zeros(len(self.global_features)))
                     for i_ti, tr_branch in enumerate(self.tracking_info):
                         if isinstance(tr_branch, str):
-                            dtype = tracking[-1].dtype
-                            tracking[-1][i_ti] = torch.as_tensor(ch[tr_branch], dtype=dtype)
-                            # tracking[-1][i_ti] = ch[tr_branch]
+                            tracking[-1][i_ti] = getattr(self.chain, tr_branch)
                         else:
                             tracking[-1][i_ti] = tr_branch
                     for i_gl, gl_branch in enumerate(self.global_features):
-                        globals[-1][i_gl] = ch[gl_branch]
+                        globals[-1][i_gl] = getattr(self.chain, gl_branch)
                 oldtime = newtime
                 newtime = time.time()
                 times[2] += newtime - oldtime
@@ -201,12 +198,6 @@ class RootDataset(DGLDataset):
             labels = torch.tensor(labels)
             tracking = torch.stack(tracking)
             globals = torch.stack(globals)
-
-            # self.graph_chunks.append(graphs)
-            # self.label_chunks.append(labels)
-            # self.tracking_chunks.append(tracking)
-            # self.global_chunks.append(globals)
-            # self.counts.append(len(graphs))
             
             if (self.chunks > 1):
                 self.save_chunk(chunk_id, graphs, labels, tracking, globals)
@@ -217,18 +208,31 @@ class RootDataset(DGLDataset):
                 self.graphs = graphs
                 self.save()
         return
-
+        self.graphs = self.graph_chunks[0]
+        for chunk in self.graph_chunks[1:]:
+            self.graphs += chunk
+        self.labels = torch.cat(self.label_chunks)
+        self.tracking = torch.cat(self.tracking_chunks)
+        self.global_features = torch.cat(self.global_chunks)
+        print('Time spent: Creating TChain: {}s, Getting Entries and Selection: {}s, Graph Creation: {}s'.format(*times))
+        print('Time spent in node_features_from_tree: {}s, full_connected_graph: {}s'.format(*self.times))
+    
     def save(self):
+        """save the graph list and the labels"""
         if not self.save_to_disk:
             return
         graph_path = os.path.join(self.save_dir, self.name + '.bin')
         if self.chunks == 1:
+            # print(len(self.graphs))
+            # print(len(self.labels))
+            # print(len(self.tracking))
+            # print(len(self.globals))
             print(f'Saving dataset to {os.path.join(self.save_dir, self.name + ".bin")}')
             dgl.save_graphs(str(graph_path), self.graphs, {'labels': torch.tensor(self.labels), 'tracking': torch.tensor(self.tracking), 'global': torch.tensor(self.global_features)})
         else:
+            print(len(self.graph_chunks))
             for i in range(len(self.process_chunks)):
                 print(f'Saving dataset to {os.path.join(self.save_dir, self.name + f"_{self.process_chunks[i]}.bin")}')
-
                 dgl.save_graphs(str(graph_path).replace('.bin', f'_{self.process_chunks[i]}.bin'), self.graph_chunks[i], {'labels': self.label_chunks[i], 'tracking': self.tracking_chunks[i], 'global': self.global_chunks[i]})
 
     def save_chunk(self, chunk_id, graphs, labels, tracking, globals):
@@ -237,7 +241,7 @@ class RootDataset(DGLDataset):
         graph_path = os.path.join(self.save_dir, self.name + '.bin')
         print(f'Saving dataset to {os.path.join(self.save_dir, self.name + f"_{self.process_chunks[chunk_id]}.bin")}')
         dgl.save_graphs(str(graph_path).replace('.bin', f'_{self.process_chunks[chunk_id]}.bin'), graphs, {'labels': labels, 'tracking': tracking, 'global': globals})
-
+            
     def has_cache(self):
         print(f'Checking for cache of {self.name}')
         if not self.save_to_disk:
@@ -286,7 +290,7 @@ class RootDataset(DGLDataset):
     
     def __len__(self):
         return len(self.graphs)
-
+    
 #Dataset with edge features added (deta, dphi, dR)
 class EdgeDataset(RootDataset):
     def make_graph(self, ch):
@@ -471,8 +475,8 @@ class MultiLabelDataset(EdgeDataset):
                 if passed:
                     graphs.append(self.make_graph(self.chain))
                     labels.append(self.get_label(self.chain))
-                    tracking.append(torch.zeros(len(self.tracking_info), dtype=FEATURE_DTYPE))
-                    globals.append(torch.zeros(len(self.global_features), dtype=FEATURE_DTYPE))
+                    tracking.append(torch.zeros(len(self.tracking_info), dtype=torch.double))
+                    globals.append(torch.zeros(len(self.global_features)))
                     for i_ti, tr_branch in enumerate(self.tracking_info):
                         if isinstance(tr_branch, str):
                             tracking[-1][i_ti] = getattr(self.chain, tr_branch)
@@ -679,4 +683,4 @@ class AugmentedDataset(RootDataset):
         dR   = torch.sqrt(deta**2 + dphi**2)
         g.edata['augmented_features'] = torch.stack([deta, dphi, dR], dim=1)
 
-        return g
+        return g
\ No newline at end of file
diff --git a/root_gnn_dgl/root_gnn_base/utils.py b/root_gnn_dgl/root_gnn_base/utils.py
index 1f9c9e7668d54f266dc1fd77fcf501d749843bf2..8f4ff676c074a05c90499318c5fe65f3b2311eb8 100644
--- a/root_gnn_dgl/root_gnn_base/utils.py
+++ b/root_gnn_dgl/root_gnn_base/utils.py
@@ -8,16 +8,10 @@ import dgl
 import signal
 
 def buildFromConfig(conf, run_time_args = {}):
-    device = run_time_args.get('device', 'cpu')
     if 'module' in conf:
         module = importlib.import_module(conf['module'])
         cls = getattr(module, conf['class'])
-        args = conf['args'].copy()
-        if 'weight' in args and isinstance(args['weight'], list):
-            args['weight'] = torch.tensor(args['weight'], dtype=torch.float, device=device)
-        # Remove device from run_time_args to not pass it to the class
-        run_time_args = {k: v for k, v in run_time_args.items() if k != 'device'}
-        return cls(**args, **run_time_args)
+        return cls(**conf['args'], **run_time_args)
     else:
         print('No module specified in config. Returning None.')
 
@@ -92,7 +86,7 @@ def pad_batch(batch, edges = 104000, nodes = 16000):
     return make_padding_graph(batch, pad_nodes, pad_edges)
 
 def pad_batch_num_nodes(batch, max_num_nodes, hid_size = 64):
-    print(f"Padding each graph to have {max_num_nodes} nodes. Using hidden size {hid_size}.")
+    print(f"Padding each graph to have {max_num_nodes} nodes")
 
     unbatched = dgl.unbatch(batch)
     for g in unbatched:
@@ -183,101 +177,21 @@ def get_specific_epoch(config, target_epoch, device = None, from_ryan = False):
             checkpoint = torch.load(os.path.join(config['Training_Directory'], f'model_epoch_{last_epoch}.pt'), map_location=device)
     return last_epoch, checkpoint
 
-#Return the index and checkpoint of the nest epoch.
-def get_best_epoch(config, var='Test_AUC', mode='max', device=None, from_ryan=False):
-    # Read the training log
-    log = read_log(config)
-    
-    # Ensure the specified variable exists in the log
-    if var not in log:
-        raise ValueError(f"Variable '{var}' not found in the training log.")
-    
-    # Determine the target epoch based on the mode ('max' or 'min')
-    if mode == 'max':
-        target_epoch = int(np.argmax(log[var]))
-        print(f"Best epoch based on '{var}' (max): {target_epoch} with value: {log[var][target_epoch]}")
-    elif mode == 'min':
-        target_epoch = int(np.argmin(log[var]))
-        print(f"Best epoch based on '{var}' (min): {target_epoch} with value: {log[var][target_epoch]}")
-    else:
-        raise ValueError(f"Invalid mode '{mode}'. Expected 'max' or 'min'.")
-    
-    # Initialize checkpoint retrieval variables
-    last_epoch = -1
-    checkpoint = None
-
-    # Iterate through epochs up to the target epoch to find the corresponding checkpoint
-    for ep in range(target_epoch + 1):
-        if from_ryan:
-            checkpoint_path = os.path.join(
-                '/global/cfs/cdirs/atlas/berobert/root_gnn_dgl/',
-                config['Training_Directory'],
-                f'model_epoch_{ep}.pt'
-            )
-        else:
-            checkpoint_path = os.path.join(
-                config['Training_Directory'],
-                f'model_epoch_{ep}.pt'
-            )
-        
-        if os.path.exists(checkpoint_path):
-            last_epoch = ep
-        else:
-            print(f'Epoch {ep} not found. Stopping at epoch {last_epoch}')
-            print('File not found: ', checkpoint_path)
-            break
-
-    # Load the checkpoint for the last valid epoch
-    if last_epoch >= 0:
-        if from_ryan:
-            checkpoint_path = os.path.join(
-                '/global/cfs/cdirs/atlas/berobert/root_gnn_dgl/',
-                config['Training_Directory'],
-                f'model_epoch_{last_epoch}.pt'
-            )
-        else:
-            checkpoint_path = os.path.join(
-                config['Training_Directory'],
-                f'model_epoch_{last_epoch}.pt'
-            )
-        
-        checkpoint = torch.load(checkpoint_path, map_location=device)
-    
-    return last_epoch, checkpoint
-
+#Convert training logs into dict for plotting.
 def read_log(config):
     lines = []
     with open(config['Training_Directory'] + '/training.log', 'r') as f:
         lines = f.readlines()
-    lines = [l for l in lines if 'Epoch' in l]
-    
+    lines = [ l for l in lines if 'Epoch' in l ]
+    nlines = len(lines)
     labels = []
     for field in lines[0].split('|'):
         labels.append(field.split()[0])
-    
-    # Initialize log as a dictionary with empty lists
-    log = {label: [] for label in labels}
-    
-    for line in lines:
-        valid_row = True  # Flag to check if the row is valid
-        temp_row = {}  # Temporary row to store values before adding to log
-        
+    log = {label : np.zeros(nlines) for label in labels}
+    for i, line in enumerate(lines):
         for field in line.split('|'):
             spl = field.split()
-            try:
-                temp_row[spl[0]] = float(spl[1])
-            except (ValueError, IndexError):
-                valid_row = False  # Mark row as invalid if conversion fails
-                break
-        
-        if valid_row:  # Only add the row if all fields are valid
-            for label in labels:
-                log[label].append(temp_row.get(label, np.nan))  # Handle missing labels gracefully
-    
-    # Convert lists to numpy arrays for consistency
-    for label in labels:
-        log[label] = np.array(log[label])
-    
+            log[spl[0]][i] = float(spl[1])
     return log
 
 #Plot training logs.
diff --git a/root_gnn_dgl/run_demo.sh b/root_gnn_dgl/run_demo.sh
index 71fc5f9514e5f8f53464d16a38f8e4a95ae832ef..c17c36add96523aaef1682922de611e9504542ec 100644
--- a/root_gnn_dgl/run_demo.sh
+++ b/root_gnn_dgl/run_demo.sh
@@ -13,7 +13,7 @@ done
 
 python scripts/training_script.py --config configs/stats_100K/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
 
-# From Scratch Training
+# Finetuning
 
 datasets=("ttH_CP_even" "ttH_CP_odd")
 chunks=3
@@ -27,11 +27,9 @@ done
 
 python scripts/training_script.py --config configs/stats_100K/ttH_CP_even_vs_odd.yaml --preshuffle --nocompile --lazy
 
-# Finetuning Training
-
 python scripts/training_script.py --config configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml --preshuffle --nocompile --lazy
 
-# Inference: Writing GNN Scores for from-scratch training and finetuned training to root files
+# Inference
 files=(
     "ttH_NLO.root"
     "ttH_CPodd.root"
@@ -50,8 +48,8 @@ branch_name=(
 for ((j=0; j<${#files[@]}; j++))
 do
     python scripts/inference.py \
-        --target "/global/cfs/projectdirs/atlas/joshua/gnn_data/stats_100K/${files[j]}" \
-        --destination "/global/cfs/projectdirs/atlas/joshua/gnn_data/scores/stats_100K/${files[j]}" \
+        --target "/global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/${files[j]}" \
+        --destination "/global/cfs/projectdirs/trn007/lbl_atlas/data/scores/stats_100K/${files[j]}" \
         --config "${config[@]}" \
         --branch_name "${branch_name[@]}" \
         --chunks 1 \
diff --git a/root_gnn_dgl/scripts/check_dataset_files.py b/root_gnn_dgl/scripts/check_dataset_files.py
deleted file mode 100644
index da5094b691ad1a2dc71b852a8bdbbd1a64938802..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/scripts/check_dataset_files.py
+++ /dev/null
@@ -1,125 +0,0 @@
-import yaml
-import os
-import subprocess
-import argparse
-
-def check_dataset_files(yaml_file, rerun=False):
-    """
-    Check if all required .bin files exist for each dataset in the YAML file.
-    """
-    try:
-        # Open and parse the YAML file
-        with open(yaml_file, 'r') as file:
-            config = yaml.safe_load(file)
-
-        # Check if 'Datasets' exists in the YAML file
-        if 'Datasets' not in config:
-            print(f"No 'Datasets' section found in {yaml_file}.")
-            return
-
-        datasets = config['Datasets']
-        all_files_exist = True
-
-        for dataset_name, dataset_config in datasets.items():
-            # Extract required information
-            save_dir = dataset_config['args']['save_dir']
-            chunks = dataset_config['args']['chunks']
-            folding = dataset_config.get('folding', {})
-            n_folds = folding.get('n_folds', 0)
-            test_folds = folding.get('test', [])
-            train_folds = folding.get('train', [])
-
-            print(f"\n== Checking dataset: {dataset_name} ==")
-            print(f"  save_dir: {save_dir}")
-            print(f"  chunks: {chunks}")
-            print(f"  n_folds: {n_folds}")
-            print(f"  test_folds: {test_folds}")
-            print(f"  train_folds: {train_folds}")
-
-            missing_files = []
-
-            # 1. Check for chunk files
-            for chunk in range(chunks):
-                chunk_file = os.path.join(save_dir, f"{dataset_name}_{chunk}.bin")
-                if not os.path.exists(chunk_file):
-                    missing_files.append(chunk_file)
-
-            # 2. Check for prebatched fold files (test and train)
-            #    Naming: dataset_name_prebatched_padded_{fold}_n_{n_folds}_f_{foldlist}.bin
-            fold_types = [('test', test_folds), ('train', train_folds)]
-            for fold_type, folds in fold_types:
-                if not folds:
-                    continue
-                foldlist_str = '_'.join(map(str, folds))
-                for i in range(chunks):
-                    prebatched_file = os.path.join(
-                        save_dir,
-                        f"{dataset_name}_prebatched_padded_{i}_n_{n_folds}_f_{foldlist_str}.bin"
-                    )
-                    if not os.path.exists(prebatched_file):
-                        missing_files.append(prebatched_file)
-
-            # Print results for the current dataset
-            if missing_files:
-                all_files_exist = False
-                print(f"  Missing files for dataset '{dataset_name}':")
-                for missing_file in missing_files:
-                    print(f"    - {missing_file}")
-
-                # Optionally rerun data prep
-                if rerun:
-                    print(f"  Reprocessing dataset '{dataset_name}' ...")
-                    prep_command = f"jobs/prep_data/prep_data.sh {yaml_file} {dataset_name} {chunks}"
-                    try:
-                        subprocess.run(prep_command, shell=True, check=True)
-                    except subprocess.CalledProcessError as e:
-                        print(f"  Could NOT reprocess '{dataset_name}': {e}")
-            else:
-                print(f"  All files exist for dataset '{dataset_name}'.")
-
-        # Final summary
-        if all_files_exist:
-            print("\nAll required files exist for all datasets.")
-        else:
-            print("\nSome files are missing.")
-
-    except Exception as e:
-        print(f"Error processing {yaml_file}: {e}")
-
-def main(pargs):
-    # Base directory containing the YAML files
-    base_directory = os.getcwd() + "/configs/"
-
-    if pargs.configs:
-        configs = [p.strip() for p in pargs.configs.split(',')]
-    else:
-        configs = [
-            "higgs_production/baseline.yaml",
-            "higgs_production/higgs_production_batch_size/higgs_production_bs_2048.yaml",
-            "higgs_production/higgs_production_batch_size/higgs_production_bs_4096.yaml",
-            "higgs_production/higgs_production_batch_size/higgs_production_bs_8192.yaml",
-        ]
-
-    for config in configs:
-        yaml_file = os.path.join(base_directory, config)
-        if os.path.exists(yaml_file):
-            print(f"\nProcessing file: {config}")
-            check_dataset_files(yaml_file, pargs.rerun)
-        else:
-            print(f"File not found: {yaml_file}")
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Check YAML config files")
-    parser.add_argument(
-        "--configs", "-c",
-        type=str,
-        required=False,
-        help="Comma-separated list of YAML config paths relative to base directory"
-    )
-    parser.add_argument(
-        "--rerun", "-r",
-        action='store_true',   # Correct way for a boolean flag
-        help="Automatically re-run data processing to fix missing files"
-    )
-    args = parser.parse_args()
-    main(args)
\ No newline at end of file
diff --git a/root_gnn_dgl/scripts/inference.py b/root_gnn_dgl/scripts/inference.py
index 0648840b2209fed36c02ecd752ca5af5921915b8..3100a08a099dcfbf3c360f51249984dc2db69e11 100644
--- a/root_gnn_dgl/scripts/inference.py
+++ b/root_gnn_dgl/scripts/inference.py
@@ -1,5 +1,6 @@
 import sys
-file_path = "/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl"
+import os
+file_path = os.getcwd()
 sys.path.append(file_path)
 import os
 import argparse
@@ -186,7 +187,6 @@ def main():
     lend = time.time()
     print('Loader finished in {:.2f} seconds'.format(lend - lstart))
     sample_graph, _, _, global_sample = loader[0]
-    global_sample = []
 
     print('dset length =', len(dset))
     print('loader length =', len(loader))
@@ -198,7 +198,6 @@ def main():
         for config_file, branch in zip(args.config, args.branch_name):
             config = load_config(config_file)
             model = utils.buildFromConfig(config['Model'], {'sample_graph' : sample_graph, 'sample_global': global_sample}).to(device)
-            
             if args.ckpt < 0:
                 ep, checkpoint = utils.get_best_epoch(config, var=args.var, mode='max', device=device)
             else:
diff --git a/root_gnn_dgl/scripts/prep_data.py b/root_gnn_dgl/scripts/prep_data.py
index 4c34de75f2bd48a633a36b4a6e4754321e958e58..87306c834cc1e84e115a85b52e609c2deb956ec7 100644
--- a/root_gnn_dgl/scripts/prep_data.py
+++ b/root_gnn_dgl/scripts/prep_data.py
@@ -33,12 +33,12 @@ def main():
         fold_conf = dset_config["folding"]
         print(f"shuffle_chunks = {shuffle_chunks}, args.chunk = {args.chunk}, padding_mode = {padding_mode}")
         if dset_config["class"] == "LazyMultiLabelDataset":
-            LazyPreBatchedDataset(start_dataset = dset, batch_size = batch_size, mask_fn = utils.fold_selection(fold_conf, "train"), suffix = utils.fold_selection_name(fold_conf, "train"), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last, hidden_size=config['Model']['args']['hid_size'] )
-            LazyPreBatchedDataset(start_dataset = dset, batch_size = batch_size, mask_fn = utils.fold_selection(fold_conf, "test"),  suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last, hidden_size=config['Model']['args']['hid_size'])
+            LazyPreBatchedDataset(start_dataset = dset, batch_size = batch_size, mask_fn = utils.fold_selection(fold_conf, "train"), suffix = utils.fold_selection_name(fold_conf, "train"), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last)
+            LazyPreBatchedDataset(start_dataset = dset, batch_size = batch_size, mask_fn = utils.fold_selection(fold_conf, "test"),  suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last)
 
         else:
-            PreBatchedDataset(dset, batch_size, utils.fold_selection(fold_conf, "train"), suffix = utils.fold_selection_name(fold_conf, "train"), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last,hidden_size=config['Model']['args']['hid_size'])
-            PreBatchedDataset(dset, batch_size, utils.fold_selection(fold_conf, "test"),  suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last,hidden_size=config['Model']['args']['hid_size'] )
+            PreBatchedDataset(dset, batch_size, utils.fold_selection(fold_conf, "train"), suffix = utils.fold_selection_name(fold_conf, "train"), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last)
+            PreBatchedDataset(dset, batch_size, utils.fold_selection(fold_conf, "test"),  suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, chunkno = args.chunk, padding_mode = padding_mode, drop_last=args.drop_last)
 
 if __name__ == "__main__":
-    main()
+    main()
\ No newline at end of file
diff --git a/root_gnn_dgl/scripts/training_script.py b/root_gnn_dgl/scripts/training_script.py
index a8bf8b3a12d6dbb075dfaebcc1a61b55d69678ff..ae733e860ba0459c470890b6be006bfe7829fe46 100644
--- a/root_gnn_dgl/scripts/training_script.py
+++ b/root_gnn_dgl/scripts/training_script.py
@@ -45,10 +45,10 @@ def gpu_mem():
     #     except:
     #         pass
     print(f'Current GPU memory usage: {torch.cuda.memory_allocated() / 1024 / 1024 / 1024} GB')
-    # print(f'Current GPU cache usage: {torch.cuda.memory_cached() / 1024 / 1024 / 1024} GB')
-    # print(f'Current GPU max memory usage: {torch.cuda.max_memory_allocated() / 1024 / 1024 / 1024} GB')
-    # print(f'Current GPU max cache usage: {torch.cuda.max_memory_cached() / 1024 / 1024 / 1024} GB')
-    # print(f'Numel in current tensors: {sum}')
+    print(f'Current GPU cache usage: {torch.cuda.memory_cached() / 1024 / 1024 / 1024} GB')
+    print(f'Current GPU max memory usage: {torch.cuda.max_memory_allocated() / 1024 / 1024 / 1024} GB')
+    print(f'Current GPU max cache usage: {torch.cuda.max_memory_cached() / 1024 / 1024 / 1024} GB')
+    print(f'Numel in current tensors: {sum}')
     mem()
 
 
@@ -263,16 +263,11 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
     for epoch in range(starting_epoch, config['Training']['epochs']):
         start = time.time()
         run = start
-        if (args.profile):
-            if (epoch == 0):
-                torch.cuda.cudart().cudaProfilerStart()
-            torch.cuda.nvtx.range_push("Epoch Start")
-
         if (args.multigpu or args.multinode):
             dist.barrier()
-        
-        if (epoch == 5):
-            exit
+        if (epoch == 2):
+            # torch.cuda.cudart().cudaProfilerStart()
+            pass
 
         # training
         model.train()
@@ -297,8 +292,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
                 if is_padded: #Padding the globals to match padded graphs.
                     global_feats = torch.concatenate((global_feats, torch.zeros(1, len(global_feats[0])).to(device)))
                 load = time.time()
-                if (args.profile):
-                    torch.cuda.nvtx.range_push("Model Forward")
                 if (len(logits) == 0):
                     logits = model(graph, global_feats)
                     tlabels = label
@@ -309,9 +302,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
                     weights = torch.concatenate((weights, track[:,1]), dim=0)
                 batch_lengths.append(logits.shape[0] - 1)
 
-                if (args.profile):
-                    torch.cuda.nvtx.range_pop() # popping model forward
-
             if is_padded:
                 keepmask = torch.full_like(logits[:,0], True, dtype=torch.bool)
                 keepmask[batch_lengths] = False
@@ -350,15 +340,11 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
                 normalized_loss += label_loss
             loss = normalized_loss / len(unique_labels)
 
-            if (args.profile):
-                torch.cuda.nvtx.range_push("Model Backward")
+
             optimizer.zero_grad()
             loss.backward()
             optimizer.step()
             total_loss += loss.detach().cpu().item()
-
-            if (args.profile):
-                torch.cuda.nvtx.range_pop() # pop model backward
             ibatch += 1
             cumulative_times[0] += batch_start - run
             cumulative_times[1] += load - batch_start
@@ -380,10 +366,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
         labels = []
         weights = []
         model.eval()
-
-        if (args.profile):
-            torch.cuda.nvtx.range_push("Model Evaluation")
-
         with torch.no_grad():
             for loader in test_loaders:
                 for batch, label, track, global_feats in loader:
@@ -404,9 +386,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
         eval_end = time.time()
         cumulative_times[3] += eval_end - run
 
-        if (args.profile):
-            torch.cuda.nvtx.range_pop() # pop evaluation
-
         if scores == []: #If validation set is empty.
             continue
         logits = torch.concatenate(scores).to(device)
@@ -496,8 +475,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
 
                 try:
                     #test_auc = roc_auc_score(labels[wgt_mask].to("cpu") == 1, scores[wgt_mask].to("cpu"), multi_class='ovr', sample_weight=weights[wgt_mask].to("cpu"))
-                    if (len(scores[0]) != config["Model"]["args"]["out_size"]):
-                        print("ERROR: The out_size and the number of class labels don't match! Please check config.")
                     test_auc = roc_auc_score(labels_onehot[wgt_mask], scores[wgt_mask].to("cpu"), multi_class='ovr', sample_weight=weights[wgt_mask].to("cpu"))
                 except ValueError:
                     test_auc = np.nan
@@ -602,9 +579,6 @@ def train(train_loaders, test_loaders, model, device, config, args, rank):
             custom_scheduler.step(model, {'test_auc':test_auc})
         scheduler.step()
 
-        if (args.profile):
-            torch.cuda.nvtx.range_pop() # pop epoch
-
     print(f"Load: {cumulative_times[0]:.4f} s")
     print(f"Batch: {cumulative_times[1]:.4f} s")
     print(f"Train: {cumulative_times[2]:.4f} s")
@@ -704,7 +678,7 @@ def main(rank=0, args=None, world_size=1, port=24500, seed=12345):
         mask_fn = utils.fold_selection(fold_conf, "train")
         if args.preshuffle:
             # ldr = ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=mask_fn, suffix = utils.fold_selection_name(fold_conf, 'train'), chunks = shuffle_chunks, padding_mode = padding_mode, use_ddp = args.multigpu, rank=rank, world_size=world_size)
-            ldr = ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=mask_fn, suffix = utils.fold_selection_name(fold_conf, 'train'), chunks = shuffle_chunks, padding_mode = padding_mode, hidden_size = config["Model"]["args"]["hid_size"])
+            ldr = ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=mask_fn, suffix = utils.fold_selection_name(fold_conf, 'train'), chunks = shuffle_chunks, padding_mode = padding_mode)
             gsamp, _, _, global_samp = ldr[0]
             sampler = None
             
@@ -721,7 +695,7 @@ def main(rank=0, args=None, world_size=1, port=24500, seed=12345):
                 sampler = DistributedSampler(ldr, num_replicas=world_size, rank=pargs.global_rank, shuffle=False, drop_last=True)
             train_loaders.append(torch.utils.data.DataLoader(ldr, batch_size = None, num_workers = 0, sampler = sampler))
             sampler = None
-            ldr = ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=mask_fn, suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, padding_mode = padding_mode, hidden_size= config['Model']['args']['hid_size'])
+            ldr = ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=mask_fn, suffix = utils.fold_selection_name(fold_conf, 'test'), chunks = shuffle_chunks, padding_mode = padding_mode)
             if (args.multigpu):
                 sampler = DistributedSampler(ldr, num_replicas=world_size, rank=rank, shuffle=False, drop_last=True)
                 # num_batches = len(ldr)
@@ -733,10 +707,10 @@ def main(rank=0, args=None, world_size=1, port=24500, seed=12345):
 
             test_loaders.append(torch.utils.data.DataLoader(ldr, batch_size = None, num_workers = 0, sampler=sampler))
 
-            # if "validation" in fold_conf:
-            #     val_loaders.append(torch.utils.data.DataLoader((ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=utils.fold_selection(fold_conf, "validation"), suffix = utils.fold_selection_name(fold_conf, 'validation'), chunks = shuffle_chunks, hidden_size=config['Model']['args']['hid_size'],  padding_mode = padding_mode, rank=rank, world_size=1)), batch_size = None, num_workers = 0, sampler = sampler))
-            # else:
-            #     print("No validation set for dataset ", dset_conf)
+            if "validation" in fold_conf:
+                val_loaders.append(torch.utils.data.DataLoader((ldr_type(start_dataset=dset, batch_size=batch_size, mask_fn=utils.fold_selection(fold_conf, "validation"), suffix = utils.fold_selection_name(fold_conf, 'validation'), chunks = shuffle_chunks, padding_mode = padding_mode, rank=rank, world_size=1)), batch_size = None, num_workers = 0, sampler = sampler))
+            else:
+                print("No validation set for dataset ", dset_conf)
         else:
             train_loaders.append(datasets.GetBatchedLoader(dset, batch_size, utils.fold_selection(fold_conf, "train")))  
             gsamp, _, _, global_samp = dset[0]
@@ -750,8 +724,6 @@ def main(rank=0, args=None, world_size=1, port=24500, seed=12345):
     print("Load time: {:.4f} s".format(load_end - load_start))
 
     model = utils.buildFromConfig(config["Model"], {'sample_graph': gsamp, 'sample_global': global_samp, 'seed': seed}).to(device)
-    pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    print(f"Number of trainable parameters = {pytorch_total_params}")
     if not args.nocompile:
         model = torch.compile(model)
     if args.multigpu:
@@ -816,7 +788,6 @@ if __name__ == "__main__":
     add_arg("--directory", type=str, help="Append to Training Directory")
     add_arg("--seed", type=int, default=2, help="Sets random seed")
     add_arg("--abs", action="store_true", help="Use abs value of per-event weight")
-    add_arg("--profile", action="store_true", help="use nsight systems profiler")
 
     pargs = parser.parse_args()
     
diff --git a/root_gnn_dgl/setup/Dockerfile b/root_gnn_dgl/setup/Dockerfile
deleted file mode 100755
index 854db7c1043c98701a35b99f4fcd5cbfaa04c102..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/setup/Dockerfile
+++ /dev/null
@@ -1,22 +0,0 @@
-FROM nvcr.io/nvidia/dgl:25.05-py3
-
-WORKDIR /workspace
-
-LABEL maintainer.name="Joshua Ho"
-LABEL maintainer.email="ho22joshua@berkeley.edu"
-
-ENV LANG=C.UTF-8
-
-# System deps (with CA certs for HTTPS downloads)
-RUN apt-get update -qq \
- && apt-get install -y --no-install-recommends \
-    wget curl ca-certificates lsb-release gnupg software-properties-common \
-    vim \
-    g++-11 gcc-11 libstdc++-11-dev \
-    openmpi-bin openmpi-common libopenmpi-dev \
- && rm -rf /var/lib/apt/lists/*
-
-# Python packages
-RUN pip install --no-cache-dir mpi4py jupyter uproot
-
-EXPOSE 8888
\ No newline at end of file
diff --git a/root_gnn_dgl/setup/build_image.sh b/root_gnn_dgl/setup/build_image.sh
deleted file mode 100755
index aae10c14f375fb9427b6bc6564375383b3b571d1..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/setup/build_image.sh
+++ /dev/null
@@ -1,2 +0,0 @@
-podman-hpc build -t joshuaho/pytorch:1.0 --platform linux/amd64 .
-podman-hpc migrate joshuaho/pytorch:1.0
diff --git a/root_gnn_dgl/setup/environment.yml b/root_gnn_dgl/setup/environment.yml
index acf2a34ea3b5e72f1e4a4f32ea3b32856362ff75..d55c25d7b4d0f7b0b2f625c5dc47b4ed1bd47f12 100644
--- a/root_gnn_dgl/setup/environment.yml
+++ b/root_gnn_dgl/setup/environment.yml
@@ -1,4 +1,4 @@
-name: pytorch
+name: dgl
 channels:
   - pytorch
   - dglteam/label/cu118
@@ -387,4 +387,5 @@ dependencies:
       - triton==2.3.0
       - typing-extensions==4.11.0
       - tzdata==2024.1
-      - uproot==5.3.7
\ No newline at end of file
+      - uproot==5.3.7
+prefix: /global/homes/j/joshuaho/.conda/envs/dgl
diff --git a/root_gnn_dgl/setup/launch_image.sh b/root_gnn_dgl/setup/launch_image.sh
deleted file mode 100644
index 6f0b08dfc55d6577b883839ff18080028dbe0a06..0000000000000000000000000000000000000000
--- a/root_gnn_dgl/setup/launch_image.sh
+++ /dev/null
@@ -1,21 +0,0 @@
-#!/bin/bash
-
-ENTRYPOINT=$1
-shift
-ARGUEMENTS="$*"
-
-echo "launched image"
-echo "Entrypoint = $ENTRYPOINT"
-echo "Arguements = $ARGUEMENTS"
-
-podman-hpc run \
-  -it \
-  --mount type=bind,source=/pscratch/sd/j/joshuaho/,target=/pscratch/sd/j/joshuaho/ \
-  --mount type=bind,source=/global/cfs/projectdirs/atlas/joshua/,target=/global/cfs/projectdirs/atlas/joshua/ \
-  --rm \
-  --network host \
-  --gpu \
-  --shm-size=32g \
-  joshuaho/pytorch:1.0 \
-  $ENTRYPOINT \
-  $ARGUEMENTS
\ No newline at end of file
diff --git a/training_time.png b/training_time.png
deleted file mode 100644
index 5cb363d853a012b511c5944ae7ea3c6fa761297a..0000000000000000000000000000000000000000
--- a/training_time.png
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:eccc855e8c797c433903e13422bd3e6024270e9db25decd8cba1d2233fd4166a
-size 292604