167 kB

Title: One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection

URL Source: https://arxiv.org/html/2505.01468

Published Time: Tue, 06 May 2025 00:01:30 GMT

Markdown Content: One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection

1 Introduction
2 Related Work 1. Exisiting benchmarks. 2. Eco-Aware NAS methods.
3 GREEN 1. 3.1 Theoretical Foundations 2. 3.2 Inputs 3. 3.3 Predictive Model Learning 4. 3.4 Multi-Objective Optimization and Ranking for Best Model Selection
1. 3.4.1 Pareto Frontier Identification
2. 3.4.2 Preference-Based Filtering and Ranking

5.   [3.5 Online Updates](https://arxiv.org/html/2505.01468v1#S3.SS5 "In 3 GREEN ‣ One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection")

2.   [A.2 Knowledge Base Creation](https://arxiv.org/html/2505.01468v1#A1.SS2 "In Appendix A Technical Appendices and Supplementary Material ‣ One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection")
3.   [A.3 Features extraction](https://arxiv.org/html/2505.01468v1#A1.SS3 "In Appendix A Technical Appendices and Supplementary Material ‣ One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection")

One Search Fits All: Pareto-Optimal Eco-Friendly Model Selection

Filippo Betello

DIAG

Sapienza University of Rome

Rome, Italy

betello@diag.uniroma1.it

&Antonio Purificato∗

DIAG

Sapienza University of Rome

Rome, Italy

purificato@diag.uniroma1.it

Vittoria Vineis∗

DIAG

Sapienza University of Rome

Rome, Italy

vineis@diag.uniroma1.it

&Gabriele Tolomei

Department of Computer Science

Sapienza University of Rome

Rome, Italy

tolomei@di.uniroma1.it

&Fabrizio Silvestri

DIAG

Sapienza University of Rome

Rome, Italy

fsilvestri@diag.uniroma1.it

Equal Contribution

Abstract

The environmental impact of Artificial Intelligence (AI) is emerging as a significant global concern, particularly regarding model training. In this paper, we introduce GREEN (Guided Recommendations of Energy-Efficient Networks), a novel, inference-time approach for recommending Pareto-optimal AI model configurations that optimize validation performance and energy consumption across diverse AI domains and tasks. Our approach directly addresses the limitations of current eco-efficient neural architecture search methods, which are often restricted to specific architectures or tasks. Central to this work is EcoTaskSet, a dataset comprising training dynamics from over 1767 experiments across computer vision, natural language processing, and recommendation systems using both widely used and cutting-edge architectures. Leveraging this dataset and a prediction model, our approach demonstrates effectiveness in selecting the best model configuration based on user preferences. Experimental results show that our method successfully identifies energy-efficient configurations while ensuring competitive performance.

1 Introduction

Artificial intelligence (AI) systems, while enabling advancements in numerous fields, come at a substantial computational and environmental cost. Training and inference for large-scale models, including Large Language Models (LLMs), require vast computational resources (e.g. 539 T CO 2-eq 1 1 1 We use the definition of CO 2-eq from Environmental Protection Agency. for the LLAMA 2 model (Touvron et al.,, 2023)), resulting in consider carbon emissions and raising urgent concerns amid global efforts to combat climate change (Bender et al.,, 2021; Faiz et al.,, 2024). While some models, such as DeepSeek (DeepSeek-AI,, 2024), have attempted to employ new structures and more efficient resource utilization, the prevailing trend continues towards increasingly large and complex models. This reliance on scale worsens the issue, as the drive for performance often ignores its environmental costs (Wu et al.,, 2022; George et al.,, 2023).

Nonetheless, while increasing attention is given to the environmental impact produced in the training and deployment phase, the energy costs of AI actually begin earlier, at the model selection and optimization stage. This phase, often underreported, involves extensive experimentation to identify the optimal model configuration 2 2 2 Throughout this paper, we refer to model configuration as a specific combination of neural architecture model and training-related parameters, namely batch size and learning rate., contributing to a significant share of the overall energy footprint (Vente et al.,, 2024). Developing methods that can predict energy-efficient configurations before training begins would, therefore, not only reduce emissions and computational overhead but also shorten the model selection process. On top of that, at a higher technical level, current approaches to eco-efficient Neural Architecture Search (NAS) methods still face the same challenges as traditional NAS: they are computationally expensive (Strubell et al.,, 2020) and often tailored to specific datasets or architectures, limiting their generalization to diverse tasks and domains (Liu et al.,, 2022).

Recent efforts have focused on mitigating this impact by optimizing hardware usage (Chung et al.,, 2024; You et al.,, 2023) and reducing the search space (Guo et al.,, 2020). For example, EC-NAS (Bakhtiarifard et al.,, 2024) extends this by optimizing both accuracy and energy consumption for image classification, but it is limited to predefined layer types. CE-NAS (Zhao et al.,, 2024), leverages reinforcement learning to optimize NAS algorithms based on GPU availability, but similarly restricts the search to a narrow set of layer types. To support these efforts, benchmarks like NAS-Bench-101 (Ying et al.,, 2019) have been proposed to enable energy-focused NAS evaluations.

Given these constraints, it would be highly beneficial to predict a model’s performance in terms of accuracy and energy consumption before execution. For instance, consider a scenario where a large-scale Neural Network (NN) for image classification requires dozens of experiments to fine-tune the number of layers, learning rate, and regularization methods. Predicting an optimal configuration upfront could eliminate the need for extensive trial-and-error runs, saving hundreds of GPU hours and avoiding significant CO 2-eq emissions.

This paper introduces a novel method named GREEN (Guided Recommendations of Energy-Efficient Networks) recommending Pareto-optimal NN configurations that balances expected performance on a validation set and energy consumption for any dataset and task across three distinct AI domains, namely computer vision, natural language processing (NLP) and recommendation systems. Crucially, this process operates entirely at inference time. Unlike existing approaches in energy-efficient multi-objective NAS, our method is highly flexible and extensible across multiple domains. It can be extended to any number and type of objectives, architectures, and datasets in the aforementioned domains, making it suitable for diverse applications. From an implementation standpoint, our approach leverages a custom multi-domain knowledge base, EcoTaskSet, constructed from over 1767 NN training processes.

Overall, the main contributions of our work can be summarized as follows:

(1) We introduce GREEN 3 3 3 The anonymous code is available here., a new method to provide multi-objective Pareto-optimal solutions for selecting the best model configurations completely at inference time and differs from current literature by being extensible to any number and type of objectives, architectures, and datasets. The overall approach is depicted in Fig.1.

(2) We create and release to the community EcoTaskSet 4 4 4 We release the anonimised dataset here., a dataset capturing neural network training dynamics across three domains: computer vision, natural language processing, and recommendation systems. It includes both well-established and cutting-edge, ready-to-use neural architectures that are widely adopted in real-world application scenarios. Moreover, unlike existing benchmarks, it provides detailed epoch-level metrics on both validation performance and energy consumption, offering a valuable resource for research in eco-efficient machine learning and the study of deep learning training dynamics.

(3) We introduce SOVA (Set-Based Order Value Alignment), a new ranking alignment metric designed to evaluate the alignment of true multi-objective metric values across two ranked sets.

(4) Extensive experiments demonstrate that GREEN successfully identifies energy-efficient configurations while maintaining competitive performance metrics.

Figure 1: An overview of GREEN. It takes as input features from EcoTaskSet. Then GREEN identifies energy-efficient configurations while maintaining competitive performance metrics. The output is a set of Pareto-optimal model configurations, which can be ranked according to user preferences to suggest the single best model configuration for a specific dataset, task, and computational infrastructure.

2 Related Work

Widely used NAS algorithms like DARTS (Liu et al.,, 2018) and Efficient NAS (Elsken et al.,, 2018), are known for being highly CO 2-intensive (Strubell et al.,, 2020). Recent studies have explored ways to mitigate this environmental impact, either by optimizing hardware usage (Chung et al.,, 2024; You et al.,, 2023) or by reducing the search space to make NAS more efficient (Guo et al.,, 2020). In parallel, several benchmarks (Dou et al.,, 2023; Bakhtiarifard et al.,, 2024; Wang et al.,, 2020) and frameworks for efficient NAS have been introduced (Zhao et al.,, 2024).

Exisiting benchmarks.

A wide range of benchmark datasets have been created to facilitate research in architecture search and efficiency-aware learning. However, these datasets often operate under restrictive design choices that limit their applicability to real-world scenarios. The NAS-Bench family of datasets—NAS-Bench-101 (Ying et al.,, 2019), NAS-Bench-201 (Dong and Yang,, 2020), and NAS-Bench-301 (Zela et al.,, 2020) define fixed, low-complexity search spaces over small-scale convolutional architectures, typically on datasets such as CIFAR-10 and CIFAR-100. NAS-Bench-101 supports only three operation types and a constrained graph topology. NAS-Bench-201 marginally expands the space to support different datasets but still excludes transformers or other contemporary architectures. NAS-Bench-301 introduces a larger search space, but relies on surrogate models trained on partial data, introducing approximation artifacts that reduce reliability, especially under distribution shifts. Crucially, none of these benchmarks provide per-epoch energy measurements. General-purpose performance prediction benchmarks such as LCBench (Zimmer et al.,, 2021) and Taskset (Metz et al.,, 2020) offer broader task coverage but similarly abstract away the training process. LCBench focuses on tabular datasets and shallow MLP architectures, providing only scalar performance metrics and metadata under fixed hyperparameters. Taskset focuses exclusively on RNN models and NLP tasks. Importantly, neither benchmark includes energy consumption tracking or system-level resource information.

These benchmarks, while useful within their respective domains, fall short of supporting sustainability-focused research or enabling fine-grained study of training behavior across architectures and domains.

Eco-Aware NAS methods.

In parallel, recent methods have extended NAS algorithms to incorporate energy or hardware-awareness. For instance, Bakhtiarifard et al., (2024) proposed EC-NAS a benchmark focused on energy-aware NAS for image classification, built upon the foundational NAS-Bench-101 dataset (Ying et al.,, 2019). EC-NAS enables multi-objective NAS by identifying models that balance energy consumption and accuracy. It outputs a Pareto frontier of optimal models, but restricts architectural choices to a predefined set of layers (e.g., 3x3 convolution, 1x1 convolution, 3x3 max pooling). However, EC-NAS has notable limitations: it reports performance metrics only at a few predefined epochs, inherits the restricted architectural diversity of NAS-Bench-101, and assumes a fixed threshold budget during search, which limits flexibility in exploring trade-offs between energy and performance. Zhao et al., (2024) introduced CE-NAS, a framework that uses (Dong and Yang,, 2020; Siems et al.,, 2020), which optimizes NAS architecture selection based on GPU availability. They proposed a reinforcement learning-based policy to allocate NAS algorithms across clusters with multiple GPUs. However also CE-NAS restricts its search space to a small set of layer types. Xu et al., (2021)proposed KNAS, a gradient-based method that is able to evaluate randomly-initialized networks. It achieves large speed-up in NAS-Bench-201 benchmarks (Dong and Yang,, 2020); however, similar to the previously discussed approaches, also KNAS is constrained by the limited architectural diversity provided by the underlying benchmark.

To address these gaps, we present EcoTaskSet, a benchmark dataset, and GREEN, a method for jointly recommending energy-efficient configurations—spanning neural architecture, training budget, and key hyperparameters—based on realistic, domain-diverse training runs.

3 GREEN

In this study, given task and dataset, we address the problem of selecting the optimal model configuration while considering user’s preferences regarding the trade-off between performance and environmental impact. From a practical viewpoint, we claim that the task of identifying the optimal model configuration for a given machine learning problem can be approached as a combined learning and optimization challenge. For this reason, we propose a solution based on a two-step approach. The first step involves a learning and prediction phase, which leverages a cross-domain knowledge base (EcoTaskSet). The second step encompasses a multi-objective optimization and preference-based ranking to select the optimal model configurations, given task and dataset. Owing to space limitations, the computational complexity analysis of the algorithm is deferred to Appendix E, while the remainder of this section focuses on the theoretical foundations and formalization of the proposed solution.

3.1 Theoretical Foundations

Assumption 3.1(Non-linear Relationship between Performance and Energy).

Let I 𝐼 I italic_I be a given computational infrastructure. For a neural network model M 𝑀 M italic_M trained on task T 𝑇 T italic_T with dataset D 𝐷 D italic_D, we denote by A e subscript 𝐴 𝑒 A_{e}italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT the performance on the validation set and by E e subscript 𝐸 𝑒 E_{e}italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT the energy consumption at epoch e 𝑒 e italic_e. We assume there exist a non-linear function:

A e,E e=f⁢(ϕ T,ϕ D,ϕ M,ϕ I,θ,e),subscript 𝐴 𝑒 subscript 𝐸 𝑒 𝑓 subscript italic-ϕ 𝑇 subscript italic-ϕ 𝐷 subscript italic-ϕ 𝑀 subscript italic-ϕ 𝐼 𝜃 𝑒 A_{e},E_{e}=f(\phi_{T},\phi_{D},\phi_{M},\phi_{I},\theta,e),italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_f ( italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ , italic_e ) ,

where ϕ T subscript italic-ϕ 𝑇\phi_{T}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT describes the task, ϕ D subscript italic-ϕ 𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT the dataset, ϕ M subscript italic-ϕ 𝑀\phi_{M}italic_ϕ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT the model configuration, ϕ I subscript italic-ϕ 𝐼\phi_{I}italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT the infrastructure characteristics, and θ 𝜃\theta italic_θ training hyperparameters. This expresses that both A e subscript 𝐴 𝑒 A_{e}italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and E e subscript 𝐸 𝑒 E_{e}italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT depend on these features in a complex, non-linear way.

Hypothesis 1(Sufficiency of Feature Descriptors).

Under Assumption3.1, a sufficiently rich set of descriptive features (ϕ T,ϕ D,ϕ M⁢ϕ I subscript italic-ϕ 𝑇 subscript italic-ϕ 𝐷 subscript italic-ϕ 𝑀 subscript italic-ϕ 𝐼\phi_{T},\phi_{D},\phi_{M},\phi_{I}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT) allows us to approximate, with small error, the variation in energy consumption and validation performance across different epochs and configurations.

Hypothesis 2(Neural Network as Universal Approximator).

We posit that the function f 𝑓 f italic_f in Assumption3.1 can be effectively approximated by a neural network, leveraging cross-domain knowledge and the interplay among the various features. Furthermore, such a neural network is capable of generalizing to new epochs and novel model configurations, thus providing a flexible framework for predicting energy consumption and performance.

Remark 3.2.

We emphasize that E 𝐸 E italic_E is our main focus because carbon intensity, which is used to estimate carbon emissions, is determined by the energy mix specific to the geographical location where the computation occurs and it represents a multiplicative factor with respect to E 𝐸 E italic_E. As shown in Faiz et al., (2024), in fact, the carbon footprint of a computational process directly correlates with the carbon intensity of the electricity used. However, since the carbon intensity is independent of the energy consumed in a computational process, we prefer focusing directy on E 𝐸 E italic_E. This choice allows us to align more directly with the existing literature on energy-efficient computation, placing our work within a broad line of research aimed at reducing the overall energy footprint of machine learning.

3.2 Inputs

Given what we assume and claim in Section 3.1, we define the input space 𝒳={ℳ,𝒯,𝒟,ℐ}𝒳 ℳ 𝒯 𝒟 ℐ\mathcal{X}={{\mathcal{M},\mathcal{T},\mathcal{D},\mathcal{I}}}caligraphic_X = { caligraphic_M , caligraphic_T , caligraphic_D , caligraphic_I }, where ℳ ℳ\mathcal{M}caligraphic_M denotes a set of NN model configurations, 𝒯 𝒯\mathcal{T}caligraphic_T a set of tasks, 𝒟 𝒟\mathcal{D}caligraphic_D a set of datasets, and ℐ ℐ\mathcal{I}caligraphic_I a set of computational infrastructures. This representation allows any machine learning problem to be expressed as a tuple (M,T,D,I)∈ℳ×𝒯×𝒟×ℐ 𝑀 𝑇 𝐷 𝐼 ℳ 𝒯 𝒟 ℐ(M,T,D,I)\in\mathcal{M}\times\mathcal{T}\times\mathcal{D}\times\mathcal{I}( italic_M , italic_T , italic_D , italic_I ) ∈ caligraphic_M × caligraphic_T × caligraphic_D × caligraphic_I, where each combination encapsulates the interactions between the model, task, dataset, and computational environment, collectively influencing system performance and resource consumption. For each set 𝒳 k∈𝒳 subscript 𝒳 𝑘 𝒳\mathcal{X}_{k}\in\mathcal{X}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_X, we define:

𝒳 k={X k,i}i=1|𝒳 k|and ϕ⁢(X k,i)={x k,i j}j=1|ϕ⁢(X k,i)|formulae-sequence subscript 𝒳 𝑘 superscript subscript subscript 𝑋 𝑘 𝑖 𝑖 1 subscript 𝒳 𝑘 and italic-ϕ subscript 𝑋 𝑘 𝑖 superscript subscript superscript subscript 𝑥 𝑘 𝑖 𝑗 𝑗 1 italic-ϕ subscript 𝑋 𝑘 𝑖\mathcal{X}{k}={X{k,i}}{i=1}^{|\mathcal{X}{k}|}\quad\text{and}\quad\phi(% X_{k,i})={x_{k,i}^{j}}{j=1}^{|\phi(X{k,i})|}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT and italic_ϕ ( italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) = { italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_ϕ ( italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) | end_POSTSUPERSCRIPT

where each element X k,i subscript 𝑋 𝑘 𝑖 X_{k,i}italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT is represented by a feature vector ϕ⁢(X k,i)∈Φ 𝒳 k italic-ϕ subscript 𝑋 𝑘 𝑖 subscript Φ subscript 𝒳 𝑘\phi(X_{k,i})\in\Phi_{\mathcal{X}{k}}italic_ϕ ( italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) ∈ roman_Φ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with Φ 𝒳 k subscript Φ subscript 𝒳 𝑘\Phi{\mathcal{X}{k}}roman_Φ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT denoting the feature space associated with the set 𝒳 k subscript 𝒳 𝑘\mathcal{X}{k}caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Notably, these feature spaces are are theoretically unbounded, allowing for infinite variability in configurations, domains, and data sources. Each feature x k,i j superscript subscript 𝑥 𝑘 𝑖 𝑗 x_{k,i}^{j}italic_x start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT corresponds then to a measurable property of X k,i subscript 𝑋 𝑘 𝑖 X_{k,i}italic_X start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT. It should be noticed, that despite being formally associated with a specific set for modeling reasons, some features, from a practical standpoint, span multiple dimensions. For instance, the number of floating-point operations for a model M 𝑀 M italic_M depends on both its architecture and the dataset characteristics.

The detailed design of these feature spaces and their associated feature sets are in Section A.2.

3.3 Predictive Model Learning

In the first step of our approach, we aim to construct a predictive function q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT that approximates the function f 𝑓 f italic_f introduced in Section 3.1, which is able to estimate the validation performance A 𝐴 A italic_A and energy consumption E 𝐸 E italic_E of a model M 𝑀 M italic_M at epoch e 𝑒 e italic_e, for a given task T 𝑇 T italic_T, dataset D 𝐷 D italic_D and computational infrastructure I 𝐼 I italic_I. Since 3.1 models A 𝐴 A italic_A and E 𝐸 E italic_E as non-linear functions of task features (ϕ T subscript italic-ϕ 𝑇\phi_{T}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT), dataset features (ϕ D subscript italic-ϕ 𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT), model configuration features (ϕ M subscript italic-ϕ 𝑀\phi_{M}italic_ϕ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT), training hyperparameters (θ 𝜃\theta italic_θ), the epoch (e 𝑒 e italic_e), and the computational infrastructure (I 𝐼 I italic_I), formally, we define:

q θ:𝒳→𝒴,𝒳=Φ T×Φ D×Φ M×Θ×ℕ×ℐ,𝒴=ℝ×ℝ≥0.:subscript 𝑞 𝜃 formulae-sequence→𝒳 𝒴 formulae-sequence 𝒳 subscript Φ 𝑇 subscript Φ 𝐷 subscript Φ 𝑀 Θ ℕ ℐ 𝒴 ℝ subscript ℝ absent 0 q_{\theta}:\mathcal{X}\to\mathcal{Y},\quad\mathcal{X}=\Phi_{T}\times\Phi_{D}% \times\Phi_{M}\times\Theta\times\mathbb{N}\times\mathcal{I},\quad\mathcal{Y}=% \mathbb{R}\times\mathbb{R}_{\geq 0}.italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y , caligraphic_X = roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × roman_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × roman_Φ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT × roman_Θ × blackboard_N × caligraphic_I , caligraphic_Y = blackboard_R × blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT .

such that q θ⁢(ϕ T,ϕ D,ϕ M,ϕ I,θ,e)=(A e,E e)subscript 𝑞 𝜃 subscript italic-ϕ 𝑇 subscript italic-ϕ 𝐷 subscript italic-ϕ 𝑀 subscript italic-ϕ 𝐼 𝜃 𝑒 subscript 𝐴 𝑒 subscript 𝐸 𝑒 q_{\theta}(\phi_{T},\phi_{D},\phi_{M},\phi_{I},\theta,e)=({A}{e},{E}{e})italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ , italic_e ) = ( italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) where Θ Θ\Theta roman_Θ is the space of training hyperparameters for q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ℕ ℕ\mathbb{N}blackboard_N is the epoch space. Here, A∈ℝ 𝐴 ℝ A\in\mathbb{R}italic_A ∈ blackboard_R is a task-dependent performance metric (e.g., accuracy for classification tasks or mean squared error for regression tasks) and E∈ℝ≥0 𝐸 subscript ℝ absent 0 E\in\mathbb{R}{\geq 0}italic_E ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT represents an environmental impact metric, such as, in our case, the energy consumption. To simplify notation, we henceforth write q θ⁢(Φ,θ,e)subscript 𝑞 𝜃 Φ 𝜃 𝑒 q{\theta}(\Phi,\theta,e)italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Φ , italic_θ , italic_e ), where Φ=Φ T∪Φ D∪Φ M∪Φ I Φ subscript Φ 𝑇 subscript Φ 𝐷 subscript Φ 𝑀 subscript Φ 𝐼\Phi=\Phi_{T}\cup\Phi_{D}\cup\Phi_{M}\cup\Phi_{I}roman_Φ = roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∪ roman_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∪ roman_Φ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT.

To determine the parameters θ∈Θ 𝜃 Θ\theta\in\Theta italic_θ ∈ roman_Θ, we minimize a step-wise weighted loss function that balances the prediction of performance (A 𝐴 A italic_A) and energy consumption (E 𝐸 E italic_E) over a sequence of training epochs. The optimization objective is:

θ∗=arg⁡min θ∈Θ⁡𝔼(Φ,θ,e)∼p⁢[ℒ⁢(q θ⁢(Φ,θ,e),(A e,E e),α e)],superscript 𝜃 subscript 𝜃 Θ subscript 𝔼 similar-to Φ 𝜃 𝑒 𝑝 delimited-[]ℒ subscript 𝑞 𝜃 Φ 𝜃 𝑒 subscript 𝐴 𝑒 subscript 𝐸 𝑒 subscript 𝛼 𝑒\theta^{*}=\arg\min_{\theta\in\Theta}\mathbb{E}{(\Phi,\theta,e)\sim p}\left[% \mathcal{L}\bigl{(}q{\theta}(\Phi,\theta,e),(A_{e},E_{e}),\alpha_{e}\bigr{)}% \right],italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( roman_Φ , italic_θ , italic_e ) ∼ italic_p end_POSTSUBSCRIPT [ caligraphic_L ( italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_Φ , italic_θ , italic_e ) , ( italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ] ,

where ℒ ℒ\mathcal{L}caligraphic_L is the composite loss function for a given epoch e 𝑒 e italic_e and uses the Mean Absolute Error (MAE) as the base metric and α e∈[0,1]subscript 𝛼 𝑒 0 1\alpha_{e}\in[0,1]italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the weight of the energy-related loss component. For predicted values A^e subscript^𝐴 𝑒\hat{A}{e}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and E^e subscript^𝐸 𝑒\hat{E}{e}over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, and true values A e subscript 𝐴 𝑒 A_{e}italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and E e subscript 𝐸 𝑒 E_{e}italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, the step-wise MAE losses are computed as:

ℒ A,e=1 B⁢∑i=1 B|A^e(i)−A e(i)|,ℒ E,e=1 B⁢∑i=1 B|E^e(i)−E e(i)|,formulae-sequence subscript ℒ 𝐴 𝑒 1 𝐵 superscript subscript 𝑖 1 𝐵 superscript subscript^𝐴 𝑒 𝑖 superscript subscript 𝐴 𝑒 𝑖 subscript ℒ 𝐸 𝑒 1 𝐵 superscript subscript 𝑖 1 𝐵 superscript subscript^𝐸 𝑒 𝑖 superscript subscript 𝐸 𝑒 𝑖\mathcal{L}{A,e}=\frac{1}{B}\sum{i=1}^{B}|\hat{A}{e}^{(i)}-A{e}^{(i)}|,% \quad\mathcal{L}{E,e}=\frac{1}{B}\sum{i=1}^{B}|\hat{E}{e}^{(i)}-E{e}^{(i)}|,caligraphic_L start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT | over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | , caligraphic_L start_POSTSUBSCRIPT italic_E , italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT | over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | ,

where B 𝐵 B italic_B is the batch size, and e∈{1,…,V}𝑒 1…𝑉 e\in{1,\ldots,V}italic_e ∈ { 1 , … , italic_V }, with V 𝑉 V italic_V being the maximum number of epochs. The composite loss at each epoch is given by:

ℒ comp,e=α e⁢ℒ A,e+(1−α e)⁢ℒ E,e,subscript ℒ comp 𝑒 subscript 𝛼 𝑒 subscript ℒ 𝐴 𝑒 1 subscript 𝛼 𝑒 subscript ℒ 𝐸 𝑒\mathcal{L}{\text{comp},e}=\alpha{e}\mathcal{L}{A,e}+(1-\alpha{e})\mathcal% {L}_{E,e},caligraphic_L start_POSTSUBSCRIPT comp , italic_e end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT italic_E , italic_e end_POSTSUBSCRIPT ,

where the dynamic weights α e subscript 𝛼 𝑒\alpha_{e}italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are computed based on the relative rates of change of the individual losses. First, we calculate the rates of change and normalize them to compute the weight α e subscript 𝛼 𝑒\alpha_{e}italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT for the loss ℒ A,e subscript ℒ 𝐴 𝑒\mathcal{L}_{A,e}caligraphic_L start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT such that:

r A,e=ℒ A,e ℒ A,e−1,r E,e=ℒ E,e ℒ E,e−1,α e=r A,e r A,e+r E,e.formulae-sequence subscript 𝑟 𝐴 𝑒 subscript ℒ 𝐴 𝑒 subscript ℒ 𝐴 𝑒 1 formulae-sequence subscript 𝑟 𝐸 𝑒 subscript ℒ 𝐸 𝑒 subscript ℒ 𝐸 𝑒 1 subscript 𝛼 𝑒 subscript 𝑟 𝐴 𝑒 subscript 𝑟 𝐴 𝑒 subscript 𝑟 𝐸 𝑒 r_{A,e}=\frac{\mathcal{L}{A,e}}{\mathcal{L}{A,e-1}},\quad r_{E,e}=\frac{% \mathcal{L}{E,e}}{\mathcal{L}{E,e-1}},\quad\alpha_{e}=\frac{r_{A,e}}{r_{A,e}% +r_{E,e}}.italic_r start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT = divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_A , italic_e - 1 end_POSTSUBSCRIPT end_ARG , italic_r start_POSTSUBSCRIPT italic_E , italic_e end_POSTSUBSCRIPT = divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_E , italic_e end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_E , italic_e - 1 end_POSTSUBSCRIPT end_ARG , italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_A , italic_e end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_E , italic_e end_POSTSUBSCRIPT end_ARG .

For initial epochs (e<2 𝑒 2 e<2 italic_e < 2), where sufficient historical data is unavailable, equal weights are assigned: α e=0.5 subscript 𝛼 𝑒 0.5\alpha_{e}=0.5 italic_α start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 0.5. The overall loss for the training process is then computed as the average of the composite losses over all epochs:

ℒ=1 V⁢∑e=1 V ℒ comp,e.ℒ 1 𝑉 superscript subscript 𝑒 1 𝑉 subscript ℒ comp 𝑒\mathcal{L}=\frac{1}{V}\sum_{e=1}^{V}\mathcal{L}_{\text{comp},e}.caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_V end_ARG ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT comp , italic_e end_POSTSUBSCRIPT .

3.4 Multi-Objective Optimization and Ranking for Best Model Selection

Once q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has been learned, the next step is to identify the optimal model configuration denoted as (M∗,e∗)superscript 𝑀 superscript 𝑒(M^{},e^{})( italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), that satisfies user-defined preferences for the trade-off between performance (ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT) and energy consumption (ω E subscript 𝜔 𝐸\omega_{E}italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT).

A naive strategy involves identifying the model configuration and epoch that minimizes the predicted energy consumption E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG, subject to a user-set performance constraint (A^≥γ^𝐴 𝛾\hat{A}\geq\gamma over^ start_ARG italic_A end_ARG ≥ italic_γ), where γ 𝛾\gamma italic_γ is the minimum performance level required by the user. Although effective in optimizing energy consumption while meeting a fixed performance threshold, such approaches inherently prioritize one objective over the other and fail to account for the trade-offs between performance and energy consumption. To address this limitation, we formulate the task as a multi-objective optimization problem, aiming to simultaneously maximize A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG and minimize E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG. The optimization proceeds in two stages reported below.

3.4.1 Pareto Frontier Identification

In the first stage, we compute the Pareto frontier (Pareto,, 1964), which identifies all non-dominated solutions where no other configuration achieves better performance with lower energy consumption. Mathematically, a solution (M i,e j)subscript 𝑀 𝑖 subscript 𝑒 𝑗(M_{i},e_{j})( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is Pareto-optimal if there exists no other solution (M i′,e j′)subscript 𝑀 superscript 𝑖′subscript 𝑒 superscript 𝑗′(M_{i^{\prime}},e_{j^{\prime}})( italic_M start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) s.t.:

A^⁢(M i′,e j′)≥A^⁢(M i,e j),E^⁢(M i′,e j′)≤E^⁢(M i,e j),formulae-sequence^𝐴 subscript 𝑀 superscript 𝑖′subscript 𝑒 superscript 𝑗′^𝐴 subscript 𝑀 𝑖 subscript 𝑒 𝑗^𝐸 subscript 𝑀 superscript 𝑖′subscript 𝑒 superscript 𝑗′^𝐸 subscript 𝑀 𝑖 subscript 𝑒 𝑗\hat{A}(M_{i^{\prime}},e_{j^{\prime}})\geq\hat{A}(M_{i},e_{j}),\quad\hat{E}(M_% {i^{\prime}},e_{j^{\prime}})\leq\hat{E}(M_{i},e_{j}),over^ start_ARG italic_A end_ARG ( italic_M start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ over^ start_ARG italic_A end_ARG ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG italic_E end_ARG ( italic_M start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≤ over^ start_ARG italic_E end_ARG ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

with at least one strict inequality. Constructing the Pareto frontier 𝒫 𝒫\mathcal{P}caligraphic_P allows to reduce the search space to configurations that represent the best trade-offs between A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG and E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG. The procedure we used to identify the Pareto frontier is presented in Appendix E.

3.4.2 Preference-Based Filtering and Ranking

In the second stage, we employ a preference-based filtering and ranking method to select either multiple or single solutions from the Pareto frontier 𝒫 𝒫\mathcal{P}caligraphic_P, based on user-defined preferences. This approach enables tailored decision-making by allowing users to define specific criteria for selection. For instance, solutions can be filtered based on a minimum performance threshold γ 𝛾\gamma italic_γ, ensuring that only configurations meeting user-specified baseline requirements are considered. If a single solution must be selected, various ranking methods can be applied to capture different prioritization strategies, such as proximity to optimal outcomes (e.g., distance to the ideal point) or user-defined preferences regarding the relative importance of one metric over another. In this context, user-defined weights (ω A,ω E)subscript 𝜔 𝐴 subscript 𝜔 𝐸(\omega_{A},\omega_{E})( italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ), where ω A+ω E=1 subscript 𝜔 𝐴 subscript 𝜔 𝐸 1\omega_{A}+\omega_{E}=1 italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = 1, can be used to represent the trade-off between validation performance and energy consumption. The score for a given configuration is then defined as:

S⁢(M,e)=ω A⁢A^e−(1−ω E)⁢E^e.𝑆 𝑀 𝑒 subscript 𝜔 𝐴 subscript^𝐴 𝑒 1 subscript 𝜔 𝐸 subscript^𝐸 𝑒 S(M,e)=\omega_{A}\hat{A}{e}-(1-\omega{E})\hat{E}_{e}.italic_S ( italic_M , italic_e ) = italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - ( 1 - italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT .

In this case the optimal solution is then:

(M∗,e∗)=arg⁢max(M,e)∈𝒫⁡S⁢(M,e).superscript 𝑀 superscript 𝑒 subscript arg max 𝑀 𝑒 𝒫 𝑆 𝑀 𝑒(M^{},e^{})=\operatorname*{arg,max}_{(M,e)\in\mathcal{P}}S(M,e).( italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ( italic_M , italic_e ) ∈ caligraphic_P end_POSTSUBSCRIPT italic_S ( italic_M , italic_e ) .

3.5 Online Updates

Since predictive accuracy of q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is critical for robust recommendations, in a real-world scenario, q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT must adapt to evolving task, dataset, and model spaces to remain effective. To achieve this, the parameters θ 𝜃\theta italic_θ can be refined in an online learning fashion. For the selected model configuration M e∗=(M∗,e∗)superscript subscript 𝑀 𝑒 superscript 𝑀 superscript 𝑒 M_{e}^{}=(M^{},e^{*})italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) actually trained on dataset D i D{{}{i}}italic_D start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT to solve task T i T{{}{i}}italic_T start_FLOATSUBSCRIPT italic_i end_FLOATSUBSCRIPT with computational infrastructure I i subscript 𝐼 𝑖 I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the update rule is:

θ←θ−η⁢∑e~~=0 e∗∇θ ℒ⁢(q θ⁢(M e~~∗),(P⁢(M e~~∗),E⁢(M e~~∗)),α e~~),←𝜃 𝜃 𝜂 superscript subscript~~𝑒 0 superscript 𝑒 subscript∇𝜃 ℒ subscript 𝑞 𝜃 superscript subscript 𝑀~~𝑒 𝑃 superscript subscript 𝑀~~𝑒 𝐸 superscript subscript 𝑀~~𝑒 subscript 𝛼~~𝑒\theta\leftarrow\theta-\eta\sum_{\tilde{e}=0}^{e^{}}\nabla_{\theta}\mathcal{L% }\left(q_{\theta}(M_{\tilde{e}}^{}),(P(M_{\tilde{e}}^{}),E(M_{\tilde{e}}^{}% )),\alpha_{\tilde{e}}\right),italic_θ ← italic_θ - italic_η ∑ start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , ( italic_P ( italic_M start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_E ( italic_M start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , italic_α start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG end_POSTSUBSCRIPT ) ,

where q θ⁢(M e~~∗)=q θ⁢(ϕ T i,ϕ D i,ϕ M∗,ϕ I i,θ,e~~)q_{\theta}(M_{\tilde{e}}^{})=q_{\theta}(\phi_{T_{i}},\phi_{D_{i}},\phi_{M{{}^% {}}},\phi_{I_{i}},\theta,\tilde{e})italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_M start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ , over~ start_ARG italic_e end_ARG ), η 𝜂\eta italic_η is the learning rate, e~~∈[0,e∗]~~𝑒 0 superscript 𝑒\tilde{e}\in[0,e^{}]over~ start_ARG italic_e end_ARG ∈ [ 0 , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] spans epochs from the initial one to e∗superscript 𝑒 e^{}italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Lastly P⁢(M∗,e~~)𝑃 superscript 𝑀~~𝑒 P(M^{},\tilde{e})italic_P ( italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over~ start_ARG italic_e end_ARG ) and , E⁢(M∗,e~~)𝐸 superscript 𝑀~~𝑒 E(M^{},\tilde{e})italic_E ( italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over~ start_ARG italic_e end_ARG ) are the performance and energy metrics obtained during the actual training with the suggested model configuration.

4 EcoTaskSet

Unlike prior benchmarks that operate in synthetic or narrow domains, EcoTaskSet is built from diversified training runs across three major areas of machine learning practice: computer vision, natural language processing, and recommendation systems. For each run, we log per-epoch validation accuracy, energy consumption, and system-level details. The models, datasets and the tasks used to create this knowledge base (KB) can be found in Table 1 and they are described in details in Section A.1. Selected well-known and established model architectures are trained for a domain-dependent number of epochs, using three different learning rates and five batch size values to account for variability in optimization dynamics. To also investigate the influence of dataset size on training dynamics, we remove varying percentages of samples from the data, ensuring an equal proportion from each class. To track the energy consumption of all the experiments we use CodeCarbon (Courty et al.,, 2023), a tool designed to track the power consumption of both CPUs and GPUs as well as additional metrics like CO 2-eq and total energy consumed. From all the samples, we extract key information that forms the features in our dataset, as described in Section 3.2: hyperparameters, infrastructural features, task features, data features and model features.

Comprehensive details of the model configurations used to build the KB and the features of our dataset are provided in Section A.2. Each sample from the dataset has two important features, which we consider the targets for GREEN: the validation metric at the selected epoch (i.e., accuracy for image classification or F1-score for text classification) and then the energy emission at the same epoch, computed via CodeCarbon. The number of samples in the dataset is 1767.

Domain	Task	Dataset	Model
Computer Vision	Classification	FOOD101, MNIST, Fashion-MNIST, CIFAR-10	AlexNet, EfficientNet, ResNet18, SqueezeNet, ViT, VGG16
NLP	Q&A, Sentiment Analysis	Google-boolq, StanfordNLP-IMDB, Dair-ai/Emotions, Rotten_tomatoes	RoBERTa, BERT, Microsoft-PHI-2, Mistral-7B
Recommendation Systems	Sequential Recommendation	FS-NYC, ML-100k, ML-1M, FS-TKY	Bert4Rec, GRU4Rec, CORE, SASRec

Table 1: Overview of the Knowledge Base used to train q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. EcoTaskSet was created selecting 3 different domains, for a total of 1767 experiments. The underlined datasets are used for testing. A detailed description of datasets and models can be found in Section A.1.

5 Experiments

5.1 Experimental Setup

We implement the predictive function q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as a transformer-based NN with 4 transformer encoder layers, each with a dimension of 256 and 8 attention heads. The feed-forward network within each encoder layer has a dimension of 512. The network is specifically designed for multivariate time series inputs, providing multi-target predictions. The hyperparameter configuration was derived through a systematic HPO process, specifically designed to minimize the MAE between the predicted metrics and the ground truth values. To validate the correctness of our approach, we conduct experiments on different machines, performing multiple runs and reporting the average results. Details about the hardware configurations and hyperparameter optimization (HPO) settings are provided in Section C.1.

We use CIFAR-10, Foursquare-TKY (hereafter, FS-TKY), and Rotten_tomatoes datasets for testing, while the others are used for training our predictor model (Table 1). In our testing setup, we evaluate the results from three independent training runs of the aforementioned predictor model, each initialized with a different random seed. The minimum threshold for validation accuracy used to filter the data and construct the Pareto fronts is set to 0.9 0.9 0.9 0.9 for CIFAR-10 and FS-TKY, and 0.45 0.45 0.45 0.45 for Rotten_tomatoes. The lower threshold for Rotten_tomatoes is due to the limited training dynamics tracked in the NLP experiments, which cover only 5 epochs, resulting in model configurations that, on average, achieve lower validation performance. We compare our approach against different baselines, further described in Appendix D.

5.2 Evaluation Metrics

The evaluation of our approach is twofold: (i)𝑖(i)( italic_i ) assessing the accuracy of the predictor model and (i⁢i)𝑖 𝑖(ii)( italic_i italic_i ) evaluating the alignment between the predicted Pareto front and the true Pareto front (i.e., based on the ground truth values of the target metrics). First, we evaluate the accuracy of the learned function q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in predicting the two target metrics: validation accuracy and cumulative energy consumption 5 5 5 The cumulative energy target is normalized to the range [0,1]0 1[0,1][ 0 , 1 ], similar to the validation accuracy, to ensure comparability and stability across different datasets and tasks.. We assess these predictions at each training epoch using MAE. Second, we assess the alignment between the predicted Pareto fronts, 𝒫 pred subscript 𝒫 pred\mathcal{P}{\text{pred}}caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT, and the true Pareto fronts, 𝒫 true subscript 𝒫 true\mathcal{P}{\text{true}}caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT, using two of the most widely adopted metrics for evaluating solution sets in multi-objective optimization (Li and Yao,, 2019), specifically the Hausdorff distance (HaD) (Henrikson,, 1999; Schutze et al.,, 2012) and the Hypervolume difference (Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V) (Zitzler and Thiele,, 1998). While HaD measures the maximum distance between the nearest points in the two sets, providing a robust indication of how closely the predicted front approximates the true front, Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V captures the difference in the dominated space, offering insight into the extent to which the predicted Pareto front covers the true one. Additionally, we employ standard classification metrics, such as Recall and F1-score, to assess the effectiveness of our approach in identifying relevant solutions. Lastly, since our ultimate goal is to recommend the optimal model configuration based on the problem setup and user preferences, we evaluate the accuracy of ranking configurations using the Normalized Discounted Cumulative Gain (NDCG) (Wang et al.,, 2013). Due to space constraints, we define all aforementioned metrics in Section C.2.

However, while these metrics effectively measure distance between the Pareto fronts and the ranking consistency, they do not directly account for alignment between ranked Pareto-optimal solutions. To address this limitation, we introduce Set-Based Order Value Alignment at k (SOVA@k), a ranking alignment metric specifically designed for multi-objective evaluation. Unlike traditional metrics, SOVA@k allows for the comparison of two sets—such as a true Pareto front and a predicted Pareto front—that may contain different items, provided both sets have the same number of ranked elements (k)𝑘(k)( italic_k )6 6 6 In scenarios where the true and predicted Pareto fronts have the same number of ranked elements (k 𝑘 k italic_k), SOVA@k effectively compares these sets. However, when ties occur, multiple items may share the same rank position, leading to sets of different lengths. To address this, SOVA@k can be extended to handle sets of varying lengths by appropriately adjusting the ranking positions to account for tied items. For a detailed explanation of this extension, please refer to the Section B.1.. Notably, it evaluates the alignment between two ranked sets not based solely on the order of items, but on the true values of relevant metrics. Specifically, it quantifies ranking alignment by computing the weighted sum of absolute differences in true objective values, applying rank-based weighting to prioritize higher-ranked positions and user-defined objective weighting to emphasize the relative importance of different objectives:

Definition 5.1.

Given a set of items I 𝐼 I italic_I, where each item is characterized by m 𝑚 m italic_m objectives, and two ranked Pareto frontiers X=(x 1,…,x k)𝑋 subscript 𝑥 1…subscript 𝑥 𝑘 X=(x_{1},...,x_{k})italic_X = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and Y=(y 1,…,y k)𝑌 subscript 𝑦 1…subscript 𝑦 𝑘 Y=(y_{1},...,y_{k})italic_Y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where x i,y i∈I subscript 𝑥 𝑖 subscript 𝑦 𝑖 𝐼 x_{i},y_{i}\in I italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_I, with k∈ℕ+𝑘 superscript ℕ k\in\mathbb{N^{+}}italic_k ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Let w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be position weights and τ j′superscript subscript 𝜏 𝑗′\tau_{j}^{\prime}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the normalized objective weights. The Set-Based Order Value Alignment (SOVA) at k 𝑘 k italic_k is defined as:

𝐒𝐎𝐕𝐀⁢(𝐗,𝐘)⁢@⁢𝐤=∑i=1 k w i⋅∑j=1 m τ j′⋅|x i⁢j−y i⁢j|,𝐒𝐎𝐕𝐀 𝐗 𝐘 bold-@𝐤 superscript subscript 𝑖 1 𝑘⋅subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗\boldsymbol{\mathrm{SOVA(X,Y)@k}}={\sum_{i=1}^{k}w_{i}\cdot\sum_{j=1}^{m}\tau_% {j}^{\prime}\cdot|{x}{ij}-{y}{ij}|},bold_SOVA bold_( bold_X bold_, bold_Y bold_) bold_@ bold_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ | italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ,

where k 𝑘 k italic_k is the number of top-ranked elements, m 𝑚 m italic_m is the number of objectives, and x i⁢j subscript 𝑥 𝑖 𝑗 x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, y i⁢j subscript 𝑦 𝑖 𝑗 y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denote the normalized true values of the j 𝑗 j italic_j-the objective at rank i 𝑖 i italic_i in the true and predicted sets, respectively. Two sets are ranked independently before being passed to the function, and w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the rank-based weight for position i 𝑖 i italic_i, calculated using exponential decay, while the user-defined objective weight τ j subscript 𝜏 𝑗\tau_{j}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is normalized:

w i=e−λ⁢i∑l=1 k e−λ⁢l,τ j′=τ j∑l=1 m τ l,w_{i}=\frac{e^{-\lambda i}}{\sum_{l=1}^{k}e^{-\lambda l}}\quad,\quad\tau^{% \prime}{j}=\frac{\tau{j}}{\sum_{l=1}^{m}\tau_{l}},italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_λ italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_λ italic_l end_POSTSUPERSCRIPT end_ARG , italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG ,

where λ>0 𝜆 0\lambda!>!0 italic_λ > 0 controls the decay rate, ensuring ∑i=1 k w i=1 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 1\sum_{i=1}^{k}w_{i}!=!1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.

SOVA@k ranges in [0,1]0 1[0,1][ 0 , 1 ], where a value of 0 indicates perfect alignment between the two sets, and a value of 1 signifies complete dissimilarity between them. Proofs of boundedness, additional details of this metric, and a comparison with other metrics can be found in Appendix B.

Dataset	MAE 0 A superscript subscript absent A 0{}_{\text{A}}^{0}start_FLOATSUBSCRIPT A end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (↓↓\downarrow↓)	MAE 0 E superscript subscript absent E 0{}_{\text{E}}^{0}start_FLOATSUBSCRIPT E end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (↓↓\downarrow↓)	MAE 30 A superscript subscript absent A 30{}_{\text{A}}^{30}start_FLOATSUBSCRIPT A end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT (↓↓\downarrow↓)	MAE 30 E superscript subscript absent E 30{}_{\text{E}}^{30}start_FLOATSUBSCRIPT E end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT (↓↓\downarrow↓)	MAE 70 A superscript subscript absent A 70{}_{\text{A}}^{70}start_FLOATSUBSCRIPT A end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 70 end_POSTSUPERSCRIPT (↓↓\downarrow↓)	MAE 70 E superscript subscript absent E 70{}_{\text{E}}^{70}start_FLOATSUBSCRIPT E end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 70 end_POSTSUPERSCRIPT (↓↓\downarrow↓)
CIFAR-10	0.116 ±plus-or-minus\pm± 0.002	0.014 ±plus-or-minus\pm± 0.001	0.122 ±plus-or-minus\pm± 0.003	0.010 ±plus-or-minus\pm± 0.000	0.163 ±plus-or-minus\pm± 0.010	0.012 ±plus-or-minus\pm± 0.004
FS-TKY	0.025 ±plus-or-minus\pm± 0.001	0.034 ±plus-or-minus\pm± 0.001	0.024 ±plus-or-minus\pm± 0.002	0.029 ±plus-or-minus\pm± 0.001	0.021 ±plus-or-minus\pm± 0.002	0.030 ±plus-or-minus\pm± 0.001
Rotten_tomatoes	0.123 ±plus-or-minus\pm± 0.022	0.014 ±plus-or-minus\pm± 0.000	0.140 ±plus-or-minus\pm± 0.018	0.019 ±plus-or-minus\pm± 0.003	0.128 ±plus-or-minus\pm± 0.024	0.035 ±plus-or-minus\pm± 0.003

Table 2: Mean ±plus-or-minus\pm± Std of MAE between the predicted values (A for validation accuracy, E for energy) from the GREEN predictor model and the corresponding ground truth values. The superscript 0, 30, 70 indicates the percentage of samples discarded in each experiment.

Dataset	NDCG (↑↑\uparrow↑)
CIFAR-10	0.985 ±plus-or-minus\pm± 0.003
FS-TKY	0.989 ±plus-or-minus\pm± 0.002
Rotten_tomatoes	0.960 ±plus-or-minus\pm± 0.011

Table 3: NDCG on predicted Pareto fronts across all runs and weight configurations [ω A,ω E]subscript 𝜔 𝐴 subscript 𝜔 𝐸[\omega_{A},\omega_{E}][ italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ], where ω A+ω E=1 subscript 𝜔 𝐴 subscript 𝜔 𝐸 1\omega_{A}+\omega_{E}=1 italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = 1.

Dataset	HaD (↓↓\downarrow↓)	Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V (↓↓\downarrow↓)
CIFAR-10	0.050 ±plus-or-minus\pm± 0.011	0.009 ±plus-or-minus\pm± 0.006
FS-TKY	0.075 ±plus-or-minus\pm± 0.002	0.049 ±plus-or-minus\pm± 0.030
Rotten_tomatoes	0.312 ±plus-or-minus\pm± 0.021	0.195 ±plus-or-minus\pm± 0.060

Table 4: Hausdorff Distance (HaD) and Hypervolume Difference (Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V) on the 3 test sets.

6 Results

Figure 2: Mean and standard deviation of SOVA@k across test datasets at varying ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT represents the weight assigned to the validation accuracy target relative to the energy target (ω A+ω E)=1 subscript 𝜔 𝐴 subscript 𝜔 𝐸 1(\omega_{A}+\omega_{E})=1( italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) = 1.

Dataset	Recall EE EE{}_{\text{EE}}start_FLOATSUBSCRIPT EE end_FLOATSUBSCRIPT (↑↑\uparrow↑)	Recall RE RE{}_{\text{RE}}start_FLOATSUBSCRIPT RE end_FLOATSUBSCRIPT (↑↑\uparrow↑)	Recall IE IE{}_{\text{IE}}start_FLOATSUBSCRIPT IE end_FLOATSUBSCRIPT (↑↑\uparrow↑)	F1 EE EE{}_{\text{EE}}start_FLOATSUBSCRIPT EE end_FLOATSUBSCRIPT (↑↑\uparrow↑)	F1 RE RE{}_{\text{RE}}start_FLOATSUBSCRIPT RE end_FLOATSUBSCRIPT (↑↑\uparrow↑)	F1 IE IE{}_{\text{IE}}start_FLOATSUBSCRIPT IE end_FLOATSUBSCRIPT (↑↑\uparrow↑)
CIFAR-10	0.367 ±plus-or-minus\pm± 0.462	0.574 ±plus-or-minus\pm± 0.302	0.971 ±plus-or-minus\pm± 0.025	0.008 ±plus-or-minus\pm± 0.011	0.059 ±plus-or-minus\pm± 0.081	0.186 ±plus-or-minus\pm± 0.106
FS-TKY	0.000 ±plus-or-minus\pm± 0.000	0.023 ±plus-or-minus\pm± 0.040	0.995 ±plus-or-minus\pm± 0.008	0.000 ±plus-or-minus\pm± 0.000	0.002 ±plus-or-minus\pm± 0.003	0.388 ±plus-or-minus\pm± 0.236
Rotten_tomatoes	0.467 ±plus-or-minus\pm± 0.115	0.894 ±plus-or-minus\pm± 0.094	0.917 ±plus-or-minus\pm± 0.144	0.142 ±plus-or-minus\pm± 0.028	0.692 ±plus-or-minus\pm± 0.068	0.532 ±plus-or-minus\pm± 0.075

Table 5: Mean ±plus-or-minus\pm± Std of performance metrics for evaluating GREEN on test datasets. Recall and F1 are reported under three scenarios: Exact Epoch (EE), Relaxed Epoch (RE, ±5 epochs), and Ignored Epoch (IE, no epoch constraints).

Prediction Accuracy.Table 2 highlights the performance of our predictive model, showing the mean and standard deviation of the MAE achieved across three independent runs on the test datasets, with different percentages of samples discarded. We observe that the accuracy of predictions is more sensitive to increasing discard rates compared to energy predictions, particularly for CIFAR-10 and Rotten_tomatoes. This suggests that as the complexity of the task increases due to modifications in the dataset, the learning behavior of the model configurations across training epochs becomes less predictable. The relatively lower predictive performance observed on the Rotten_tomatoes dataset is likely due to the specific nature of the model architectures considered in the NLP domain, particularly their dimensionality, along with the lower number of tracked training dynamics and the shorter sequence of training epochs for NLP experiments used to train the predictor model, compared to experiments in other domains tracked in EcoTaskSet (see Table 1).Further sanity checks assessing the robustness of our predictive pipeline are provided in Appendix E, while disaggregated results can be found in Appendix F.

Robustness and Effectiveness of Ranking. Given that our ultimate goal is to recommend Pareto-optimal combinations of model architecture, batch size, learning rate, and number of training epochs that closely align with the true Pareto front, ensuring consistent and reliable rankings of model configurations based on user preferences is more critical than achieving perfect accuracy in predicting the target metrics. Even with minor prediction errors, in fact, as long as these errors are systematically consistent with the true values, their overall impact remains minimal. This is evidenced by the high average NDCG scores reported in Table 4, which demonstrates strong alignment between the predicted and actual rankings across different weight configurations that modulate the relative importance of the two objectives in determining the ranking score. The weight configurations include different combinations of ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ω E subscript 𝜔 𝐸\omega_{E}italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, with values ranging between 0 and 1 in steps of 0.1. These results highlight the robustness of GREEN in selecting and ranking different model configurations for a given problem setting, even when preferences vary regarding the prioritization of objectives.

Pareto Front Alignment.Table 5, instead, provides quantitative insights into the quality of the predicted Pareto front by showcasing the Recall and F1-scores under different evaluation settings. Specifically, the table presents metrics computed under three evaluation scenarios: the Exact Epoch (EE) scenario assesses the alignment between the true and predicted Pareto fronts by considering both the model configuration and the exact number of training epochs; the Relaxed Epoch (RE) scenario allows for minor deviations in epoch selection (±5 plus-or-minus 5\pm 5± 5 epochs) while still evaluating the compatibility of the Pareto fronts; and the Ignored Epoch (IE) scenario evaluates the overlap of items in the Pareto fronts without considering the suggested number of training epochs. The results indicate that under the EE setting—where both the model configuration and the exact training epoch must be correctly predicted—the Recall and F1 scores are generally lower, particularly for the FS-TKY dataset, which shows minimal overlap due to its wider epoch space (400 epochs) compared to the CV and NLP domains, where the epoch sequences has respectively length 100 and 5. It should be emphasized, however, that this level of granularity in recommendations— particularly concerning the precise alignment of model configurations and training epochs— is rarely addressed in most NAS or model selection approaches. Significant improvements in Pareto matching performance are observed when transitioning to the RE setting and, even more so, to the IE setting, demonstrating the effectiveness of GREEN in accurately identifying configurations that are truly Pareto-optimal. Table 4 reports the Hausdorff Distance (HaD) and Hypervolume Difference (Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V) across the three test datasets, offering complementary views on spatial deviation and coverage between the predicted and true Pareto fronts. In general, both metrics yield values close to zero, indicating strong alignment and effective approximation of the true fronts. The slightly higher HaD and Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V values observed for the Rotten_tomatoes dataset is consistent with the less accurate underlying predictions for the objective metrics in that setting. Overall, these results highlight the effectiveness of GREEN in accurately capturing the structure and composition of the Pareto front across various datasets and problem settings, while also highlighting the sensitivity of front reconstruction to the effectiveness of the target prediction task. Due to space constraints, a visual comparison of the predicted and true Pareto frontiers is presented in Appendix E as complementary material.

Consistency of Ranked Pareto-optimal Solutions. The SOVA@k results, presented in Fig.2, provide an in-depth analysis of GREEN’s ability to maintain ranking consistency between the predicted and true Pareto fronts across different datasets and varying objective weights (represented as ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT on the x-axis). For CIFAR-10, the SOVA@1, SOVA@5, and SOVA@10 scores remain relatively low and stable across most values of ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, demonstrating consistent alignment of rankings. However, a slight increase in SOVA@k values is observed as ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT approaches 1, indicating minor performance degradation when the priority shifts solely toward maximizing performance, regardless of energy consumption. The FS-TKY dataset displays consistently low and stable SOVA@k scores across all settings, suggesting that GREEN effectively preserves ranking consistency in this domain, even under diverse weight configurations. In contrast, the Rotten_tomatoes dataset reveals an upward trend in SOVA@k scores—particularly SOVA@10—as ω A subscript 𝜔 𝐴\omega_{A}italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT increases. This degradation is closely tied to the comparatively higher MAE in validation accuracy for Rotten_tomatoes (reported in Table2), where even small prediction errors can lead to substantial ranking misalignments 7 7 7 To better understand this, recall that SOVA@k computes a weighted sum of absolute differences in the true objective values of matched configurations.

Comparison with Competitors. To the best of our knowledge, no existing method directly addresses the goal of energy-aware, cross-domain model selection over standard architectures. To contextualize our results, however, we compare our approach with two representative Eco-NAS baselines: EC-NAS and KNAS. Notably, these methods are specifically designed for Eco-NAS within constrained architecture spaces composed of closely related models, whereas GREEN targets a broader scenario, aiming to generalize across datasets and architecture families. Despite considerable effort, it was not possible to adapt the codebases of these baselines to our cross-domain search space. For this reason, we provide both an illustrative and a more NAS-specific quantitative comparison by presenting their results on the NAS benchmarks originally used in their respective publications—NASBench-101 for EC-NAS and NASBench-201 for KNAS. As part of the illustrative comparison, Table 7 reports the predicted performance of the Pareto-optimal configurations suggested by each method—both when the objective is to maximize performance (_MA) and when equal importance is given to performance and energy consumption (_B), along with the runtime required to produce each solution. As shown in the table, GREEN consistently recommends configurations that achieve strong trade-offs between validation accuracy and energy usage. Moreover, a notable advantage of GREEN lies in its efficiency: although training the predictive model incurs a one-time computational cost, inference is extremely fast. In contrast, the Eco-NAS baselines require re-running the full optimization or search process for every new dataset or constraint, resulting in significantly higher computational overhead. In the second comparative setting, we evaluate the behavior of GREEN within a NAS-specific benchmark. Specifically, we assess its performance on NASBench-101, enabling a fair comparison with EC-NAS and showcasing its capacity to generalize beyond its original cross-domain design. As shown in Table 7, although GREEN was not originally designed for NAS tasks, it nonetheless demonstrates the ability to operate effectively within such constrained settings. Enhancing the performance of GREEN in NAS-specific contexts—through the development of refined feature representations that better capture the subtle architectural distinctions characteristic of NASBench-style benchmarks—is left to future work.

Method	Predicted A (acc)	Predicted E (kWh)	Time (s)
EC-NAS MA MA{}_{\text{MA}}start_FLOATSUBSCRIPT MA end_FLOATSUBSCRIPT	0.822 (-0.148)	27.745 (27.673)	564
EC-NAS B B{}_{\text{B}}start_FLOATSUBSCRIPT B end_FLOATSUBSCRIPT	0.771 (-0.191)	8.827 (8.814)	564
KNAS	0.183 (-0.787)	0.526 (0.454)	27,960
GREEN MA MA{}_{\text{MA}}start_FLOATSUBSCRIPT MA end_FLOATSUBSCRIPT (ours)	0.899 (-0.071)	0.509 (0.437)	1,241+12
GREEN B B{}_{\text{B}}start_FLOATSUBSCRIPT B end_FLOATSUBSCRIPT (ours)	0.887 (-0.075)	0.086(0.073)	1,241+12

Table 6: Comparison of GREEN vs. competitors in accuracy (A), energy (E), and computational time (s). Brackets show the gap between suggested configs and the best ground truth in EcoTaskSet. In bold is highlighted the best result for each column. Bold value after +++ for GREEN shows inference time, as training occurs once.

Method	Validation Accuracy	Training Time (s)
EC-NAS	0.946(−0.5%)percent 0.5(-0.5%)( - 0.5 % )	3160 (−34.1%)percent 34.1(-34.1%)( - 34.1 % )
GREEN (ours)	0.917 (−3.5%)percent 3.5(-3.5%)( - 3.5 % )	1628(−66.0%)percent 66.0(-66.0%)( - 66.0 % )

Table 7: Comparison of GREEN and EC-NAS in terms of ground-truth validation accuracy and training time (in seconds), as reported in NASBench-101. Each solution corresponds to the predicted Pareto-optimal configuration maximizing validation accuracy at epoch 108. The values, shown in bold in the table represent the best solution for each individual objective.

7 Conclusions and Future Work

This work addresses the critical challenge of environmental sustainability in AI development by introducing a novel approach to eco-efficient model selection and optimization. Our method offers a flexible, domain-agnostic solution for recommending Pareto-optimal NN configurations that balance performance and energy consumption. Operating at inference time, our approach overcomes the limitations of traditional NAS and HPO, demonstrating effectiveness across diverse AI domains.

The release of EcoTaskSet provides researchers and practitioners with valuable resources to advance eco-efficient machine learning. We hope that our work contributes to a more sustainable future by enabling informed decisions that consider performance and energy efficiency.

Future work aims to develop a framework that automatically updates the knowledge base with new experiments, enabling EcoTaskSet to expand to various tasks without manual intervention.

References

Bakhtiarifard et al., (2024) Bakhtiarifard, P., Igel, C., and Selvan, R. (2024). Ec-nas: Energy consumption aware tabular benchmarks for neural architecture search. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5660–5664. IEEE.
Bender et al., (2021) Bender, E.M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
Betello et al., (2024) Betello, F., Purificato, A., Siciliano, F., Trappolini, G., Bacciu, A., Tonellotto, N., and Silvestri, F. (2024). A reproducible analysis of sequential recommender systems. IEEE Access.
Bossard et al., (2014) Bossard, L., Guillaumin, M., and Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13, pages 446–461. Springer.
Chung et al., (2024) Chung, J.-W., Gu, Y., Jang, I., Meng, L., Bansal, N., and Chowdhury, M. (2024). Reducing energy bloat in large model training. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 144–159.
Courty et al., (2023) Courty, B., Schmidt, V., Goyal-Kamal, Coutarel, M., Feld, B., Lecourt, J., SabAsmine, kngoyal, Léval, M., Cruveiller, A., inimaz, ouminasara, Zhao, F., Joshi, A., Bogroff, A., Saboni, A., de Lavoreille, H., Laskaris, N., Blanche, L., Abati, E., LiamConnell, Blank, D., Wang, Z., Catovic, A., Stkechly, M., alencon, JPW, MinervaBooks, Çarkacı, N., and DomAlexRod (2023). mlco2/codecarbon: v2.3.2.
Dale and Chall, (1948) Dale, E. and Chall, J.S. (1948). A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
DeepSeek-AI, (2024) DeepSeek-AI (2024). Deepseek-v3 technical report.
Devlin et al., (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dong and Yang, (2020) Dong, X. and Yang, Y. (2020). Nas-bench-201: Extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326.
Dosovitskiy, (2020) Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dou et al., (2023) Dou, S., Jiang, X., Zhao, C.R., and Li, D. (2023). Ea-has-bench: Energy-aware hyperparameter and architecture search benchmark. In The Eleventh International Conference on Learning Representations.
Elsken et al., (2018) Elsken, T., Metzen, J.H., and Hutter, F. (2018). Efficient multi-objective neural architecture search via lamarckian evolution. arXiv preprint arXiv:1804.09081.
Faiz et al., (2024) Faiz, A., Kaneda, S., Wang, R., Osi, R., Sharma, P., Chen, F., and Jiang, L. (2024). Llmcarbon: Modeling the end-to-end carbon footprint of large language models. In The Twelfth International Conference on Learning Representations. ICLR.
George et al., (2023) George, A.S., George, A.H., and Martin, A.G. (2023). The environmental impact of ai: a case study of water consumption by chat gpt. Partners Universal International Innovation Journal, 1(2):97–104.
Guo et al., (2020) Guo, Y., Chen, Y., Zheng, Y., Zhao, P., Chen, J., Huang, J., and Tan, M. (2020). Breaking the curse of space explosion: Towards efficient nas with curriculum search. In International Conference on Machine Learning, pages 3822–3831. PMLR.
Harper and Konstan, (2015) Harper, F.M. and Konstan, J.A. (2015). The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4).
He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Henrikson, (1999) Henrikson, J. (1999). Completeness and total boundedness of the hausdorff metric. MIT Undergraduate Journal of Mathematics, 1(69-80):10.
Hidasi et al., (2016) Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. (2016). Session-based recommendations with recurrent neural networks.
Hou et al., (2022) Hou, Y., Hu, B., Zhang, Z., and Zhao, W.X. (2022). Core: simple and effective session-based recommendation within consistent representation space.
Iandola, (2016) Iandola, F.N. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360.
Javaheripi et al., (2023) Javaheripi, M., Bubeck, S., Abdin, M., Aneja, J., Bubeck, S., Mendes, C. C.T., Chen, W., Del Giorno, A., Eldan, R., Gopi, S., et al. (2023). Phi-2: The surprising power of small language models. Microsoft Research Blog.
Jiang et al., (2023) Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D. d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825.
Kang and McAuley, (2018) Kang, W.-C. and McAuley, J. (2018). Self-attentive sequential recommendation.
Kincaid, (1975) Kincaid, J. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Chief of Naval Technical Training.
Krizhevsky et al., (2009) Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
Krizhevsky et al., (2012) Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
LeCun et al., (2010) LeCun, Y., Cortes, C., and Burges, C. (2010). Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2.
Li and Yao, (2019) Li, M. and Yao, X. (2019). Quality evaluation of solution sets in multiobjective optimisation: A survey. ACM Computing Surveys (CSUR), 52(2):1–38.
Liu et al., (2018) Liu, H., Simonyan, K., and Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
Liu et al., (2022) Liu, S., Zhang, H., and Jin, Y. (2022). A survey on computationally efficient neural architecture search. Journal of Automation and Intelligence, 1(1):100002.
Liu, (2019) Liu, Y. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364.
Metz et al., (2020) Metz, L., Maheswaranathan, N., Sun, R., Freeman, C.D., Poole, B., and Sohl-Dickstein, J. (2020). Using a thousand optimization tasks to learn hyperparameter search strategies. arXiv preprint arXiv:2002.11887.
Pang and Lee, (2005) Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
Pareto, (1964) Pareto, V. (1964). Cours d’économie politique, volume 1. Librairie Droz.
Saravia et al., (2018) Saravia, E., Liu, H.-C.T., Huang, Y.-H., Wu, J., and Chen, Y.-S. (2018). CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium. Association for Computational Linguistics.
Schutze et al., (2012) Schutze, O., Esquivel, X., Lara, A., and Coello, C. A.C. (2012). Using the averaged hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Transactions on Evolutionary Computation, 16(4):504–522.
Siems et al., (2020) Siems, J., Zimmer, L., Zela, A., Lukasik, J., Keuper, M., and Hutter, F. (2020). Nas-bench-301 and the case for surrogate benchmarks for neural architecture search. arXiv preprint arXiv:2008.09777, 4:14.
Simonyan and Zisserman, (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Strubell et al., (2020) Strubell, E., Ganesh, A., and McCallum, A. (2020). Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13693–13696.
Sun et al., (2019) Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., and Jiang, P. (2019). Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer.
Tan and Le, (2019) Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
Touvron et al., (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models.
Vente et al., (2024) Vente, T., Wegmeth, L., Said, A., and Beel, J. (2024). From clicks to carbon: The environmental toll of recommender systems. In Proceedings of the 18th ACM Conference on Recommender Systems, RecSys ’24, page 580–590, New York, NY, USA. Association for Computing Machinery.
Wang et al., (2013) Wang, Y., Wang, L., Li, Y., He, D., and Liu, T.-Y. (2013). A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR.
Wang et al., (2020) Wang, Y., Wang, Q., Shi, S., He, X., Tang, Z., Zhao, K., and Chu, X. (2020). Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 744–751. IEEE.
Wu et al., (2022) Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., et al. (2022). Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813.
Xiao et al., (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
Xu et al., (2021) Xu, J., Zhao, L., Lin, J., Gao, R., Sun, X., and Yang, H. (2021). Knas: green neural architecture search. In International Conference on Machine Learning, pages 11613–11625. PMLR.
Yang et al., (2014) Yang, D., Zhang, D., Zheng, V.W., and Yu, Z. (2014). Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129–142.
Ying et al., (2019) Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., and Hutter, F. (2019). Nas-bench-101: Towards reproducible neural architecture search. In International conference on machine learning, pages 7105–7114. PMLR.
You et al., (2023) You, J., Chung, J.-W., and Chowdhury, M. (2023). Zeus: Understanding and optimizing {{{{GPU}}}} energy consumption of {{{{DNN}}}} training. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 119–139.
Zela et al., (2020) Zela, A., Siems, J., Zimmer, L., Lukasik, J., Keuper, M., and Hutter, F. (2020). Surrogate nas benchmarks: Going beyond the limited search spaces of tabular nas benchmarks. arXiv preprint arXiv:2008.09777.
Zhao et al., (2024) Zhao, Y., Liu, Y., Jiang, B., and Guo, T. (2024). CE-NAS: An end-to-end carbon-efficient neural architecture search framework. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Zimmer et al., (2021) Zimmer, L., Lindauer, M., and Hutter, F. (2021). Auto-pytorch tabular: Multi-fidelity metalearning for efficient and robust autodl. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9):3079 – 3090.
Zitzler and Thiele, (1998) Zitzler, E. and Thiele, L. (1998). Multiobjective optimization using evolutionary algorithms—a comparative case study. In International conference on parallel problem solving from nature, pages 292–301. Springer.

Appendix A Technical Appendices and Supplementary Material

A.1 Overview of Datasets and Models

Vision Models

•AlexNet(Krizhevsky et al.,, 2012): One of the first CNNs, known for its 8-layer architecture, performed well in large-scale image classification.
•EfficientNet(Tan and Le,, 2019): A family of CNN that balances accuracy and efficiency by systematically scaling width, depth and resolution.
•ResNet18(He et al.,, 2016): 18-layer lightweight residual NN using skip connections to solve the vanishing gradient problem.
•SqueezeNet(Iandola,, 2016): An ultra-lightweight convolutional NN designed for model size efficiency with fire modules for parameter reduction.
•ViT(Dosovitskiy,, 2020): A transformer-based architecture that applies self-attention mechanisms to image patches for superior image recognition.
•VGG16(Simonyan and Zisserman,, 2014): A deep convolutional NN with 16 layers, known for its simplicity and uniform use of 3x3 convolutional filters.

Vision Datasets

•CIFAR-10(Krizhevsky et al.,, 2009): It consists of 60,000 60 000 60,000 60 , 000 32⁢x⁢32 32 𝑥 32 32x32 32 italic_x 32 color images divided in 10 classes, with 6,000 6 000 6,000 6 , 000 images per class. There are 50,000 50 000 50,000 50 , 000 training images and 10,000 10 000 10,000 10 , 000 test images.
•FOOD101(Bossard et al.,, 2014): It comprises 101 food categories with 750 750 750 750 training and 250 250 250 250 test images per category, for a total of 101⁢K 101 𝐾 101K 101 italic_K images.
•MNIST(LeCun et al.,, 2010): It is a large collection of handwritten digits. It has a training set of 60,000 60 000 60,000 60 , 000 examples, and a test set of 10,000 10 000 10,000 10 , 000 examples.
•Fashion-MNIST(Xiao et al.,, 2017): It consists of 28⁢x⁢28 28 𝑥 28 28x28 28 italic_x 28 greyscale images of 70,000 70 000 70,000 70 , 000 fashion products from 10 categories, with 7,000 7 000 7,000 7 , 000 images per category. The training set has 60,000 60 000 60,000 60 , 000 images and the test set has 10,000 10 000 10,000 10 , 000 images.

Text Models

•RoBERTa(Liu,, 2019): An optimized version of BERT by Facebook that improves performance through larger datasets and longer training.
•BERT(Devlin et al.,, 2019): A groundbreaking transformer-based model by Google that uses bidirectional attention to understand the context of words in a sentence.
•Microsoft-PHI-2(Javaheripi et al.,, 2023): A small LLM specialized model by Microsoft, a Transformer with 2.7 billion parameters.
•Mistral-7B-v0.3(Jiang et al.,, 2023): A highly efficient, open-weight, 7-billion-parameter language model offering strong performance in text generation and understanding tasks.

Text Datasets

•Google-boolq 8 8 8https://huggingface.co/datasets/google/boolq: It’s a question answering dataset for yes/no questions containing 15942 example. Each example is a triplet of (question, passage, answer).
•StanfordNLP-IMDB 9 9 9https://huggingface.co/datasets/stanfordnlp/imdb: This is a dataset for binary sentiment classification. They provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
•Dair-ai/Emotions 10 10 10https://huggingface.co/datasets/dair-ai/emotion(Saravia et al.,, 2018): It is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.
•Rotten_tomatoes 11 11 11https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes(Pang and Lee,, 2005): This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten_tomatoes movie reviews.

Recommendation Models

•BERT4Rec(Sun et al.,, 2019): This model is based on the BERT architecture, enabling it to capture complex relationships in user behaviour sequences through bidirectional self-attention.
•CORE(Hou et al.,, 2022): it introduces an attention mechanism that enables the model to weigh the contribution of each item in the input sequence, enhancing recommendation accuracy.
•GRU4Rec(Hidasi et al.,, 2016): This model utilizes GRUs to capture temporal dependencies in user-item interactions.
•SASRec(Kang and McAuley,, 2018): This model is characterized by its use of self-attention mechanisms, allowing it to discern the relevance of each item within the user’s sequence.

Recommendation Datasets

•Foursquare 12 12 12https://sites.google.com/site/yangdingqi/home/foursquare-dataset: These datasets contain check-ins collected over a period of approximately ten months (Yang et al.,, 2014). We use the New York City (FS-NYC) and Tokyo (FS-TKY) versions.
•MovieLens 13 13 13https://grouplens.org/datasets/movielens: The MovieLens dataset (Harper and Konstan,, 2015) is widely recognized as a benchmark for evaluating recommendation algorithms. We utilize two versions: MovieLens 1M (ML-1M) and MovieLens 100k (ML-100k).

Our pre-processing approach adheres to common practices, where ratings are treated as implicit feedback, meaning all interactions are utilized regardless of their rating values, and users or items with fewer than five interactions are excluded (Kang and McAuley,, 2018; Sun et al.,, 2019). For testing, similar to (Sun et al.,, 2019; Kang and McAuley,, 2018), the final interaction of each user is used for test, while the second-to-last interaction is used for validation, with all other interactions forming the training set.

A.2 Knowledge Base Creation

All our experiment were performed with 5 different values of batch size: 32,64,128,256,512 32 64 128 256 512 32,64,128,256,512 32 , 64 , 128 , 256 , 512, and with three different values of learning rate 10−3,10−4,10−5 superscript 10 3 superscript 10 4 superscript 10 5 10^{-3},10^{-4},10^{-5}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We tried to use values which are commonly used in literature. Lastly, in order to study the influence of the size of the dataset, and consequently the complexity of the task, on the energy consumption and on the test performance of the corresponding training, we removed different percentages of samples from the data, discarding the same percentage for each of the classes. In particular, we performed our experiments initially with the entire dataset and then we removed 30% and 70% of the samples from the dataset.

Considering all the datasets used for the experiments, we have a total of 1767 experiments, divided into:

•989 computer vision experiments, of which 252 configurations are used for testing;
•637 recommendation systems experiments, of which 180 configurations are used for testing;
•141 natural language processing experiments, of which 36 configurations are used for testing;

A.3 Features extraction

All the features available in EcoTaskSet can be found in Table 12. Some of them are the same for all the tasks, while others are task-specific. The task features are extracted using Python code computing data statistics. The infrastructural features are extracted using CodeCarbon 14 14 14https://codecarbon.io library, which allows to extract hardware-specific information. The FLOPS and the number of parameters of the models are extracted using DeepSpeed 15 15 15https://www.deepspeed.ai/ library. All the other model features are extracted using the information available using Pytorch 16 16 16https://pytorch.org/ library, except for LoRA rank in Attention Layers, extracted using HuggingFace 17 17 17https://huggingface.co/ library and Python code, as the mean sequence length, maximum sequence length, Mean Flesch–Kincaid Grade level (Kincaid,, 1975) and Mean Dale-Chall Readability score (Dale and Chall,, 1948). All the recommendation features are extracted using EasyRec library (Betello et al.,, 2024).

In order to deal only with numerical features, the few textual features (i.e., the type of activation function) are binarized. Regarding samples with different length, we use padding. This is because there could be models with 6 batch normalization layers, each with its own characteristics, while other models can have 10 batch normalization layers. We used a padding value which our network is able to recognize and the padding length is equal to the length of the longest list.

Metric	Similarities	Differences
NDCG	Uses rank-based weighting (higher-ranked items matter more).	NDCG is relevance-based and does not consider multiple objectives or absolute differences in values. It evaluates a single ranking.
Kendall’s Tau	Measures ranking consistency between two sets.	SOVA@k incorporates true values in ranking, rather than just rank positions.
Spearman’s Rank Correlation	Measures monotonic relationships between rankings.	Spearman’s method is a purely ordinal measure and does not use value-based distance like SOVA@k.
Hausdorff Distance	Measures the largest distance between points in two sets.	Hausdorff applies in geometric spaces, while SOVA@k operates on ranked sets of multi-objective scores.
Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V	Compares two Pareto fronts based on dominated space.	Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V focuses on set coverage, while SOVA@k compares rankings at a fixed k 𝑘 k italic_k.
Borda Count	Uses weighted scores for decision-making across multiple criteria.	SOVA@k does not aggregate rankings but measures distance from an ideal ranking.

Table 8: Comparison of SOVA@k with Existing Metrics

Appendix B Description of Set-Based Order Value Alignment (SOVA) metric

Boundedness of SOVA@k

To ensure that the Set-Based Order Value Alignment at k (SOVA@k) is well-defined and interpretable, we prove that it is always bounded within the interval [0,1]0 1[0,1][ 0 , 1 ].

Lemma B.1.

min X,Y⁡SOVA⁢(X,Y)⁢@⁢k=0 subscript 𝑋 𝑌 SOVA X Y@k 0\min_{X,Y}\mathrm{SOVA(X,Y)@k}=0 roman_min start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT roman_SOVA ( roman_X , roman_Y ) @ roman_k = 0

Proof.

The SOVA@k metric is a sum of non-negative terms:

SOVA⁢(X,Y)⁢@⁢k=∑i=1 k w i⋅∑j=1 m τ j′⋅|x i,j−y i,j|,SOVA 𝑋 𝑌@𝑘 superscript subscript 𝑖 1 𝑘⋅subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗\mathrm{SOVA}(X,Y)@k=\sum_{i=1}^{k}w_{i}\cdot\sum_{j=1}^{m}\tau_{j}^{\prime}% \cdot|x_{i,j}-y_{i,j}|,roman_SOVA ( italic_X , italic_Y ) @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ,

where w i,τ j′≥0 subscript 𝑤 𝑖 superscript subscript 𝜏 𝑗′0 w_{i},\tau_{j}^{\prime}\geq 0 italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 0 (by construction) and |x i,j−y i,j|≥0 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 0|x_{i,j}-y_{i,j}|\geq 0| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ≥ 0 (since absolute differences are non-negative). Thus, SOVA⁢(X,Y)⁢@⁢k≥0 SOVA 𝑋 𝑌@𝑘 0\mathrm{SOVA}(X,Y)@k\geq 0 roman_SOVA ( italic_X , italic_Y ) @ italic_k ≥ 0.

To show attainability, let X=Y 𝑋 𝑌 X=Y italic_X = italic_Y. Then x i,j=y i,j subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 x_{i,j}=y_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for all i,j 𝑖 𝑗 i,j italic_i , italic_j, so |x i,j−y i,j|=0 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 0|x_{i,j}-y_{i,j}|=0| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | = 0. Substituting into SOVA@k:

SOVA⁢(X,X)⁢@⁢k=∑i=1 k w i⋅∑j=1 m τ j′⋅0=0.SOVA 𝑋 𝑋@𝑘 superscript subscript 𝑖 1 𝑘⋅subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′0 0\mathrm{SOVA}(X,X)@k=\sum_{i=1}^{k}w_{i}\cdot\sum_{j=1}^{m}\tau_{j}^{\prime}% \cdot 0=0.roman_SOVA ( italic_X , italic_X ) @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ 0 = 0 .

Therefore, the minimum value of SOVA@k is achievable and equals 0. ∎

Lemma B.2.

max X,Y⁡SOVA⁢(X,Y)⁢@⁢k=1 subscript 𝑋 𝑌 SOVA X Y@k 1\max_{X,Y}\mathrm{SOVA(X,Y)@k}=1 roman_max start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT roman_SOVA ( roman_X , roman_Y ) @ roman_k = 1

Proof.

Since objectives are normalized to [0,1]0 1[0,1][ 0 , 1 ], we have |x i,j−y i,j|≤1 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 1|x_{i,j}-y_{i,j}|\leq 1| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ≤ 1 for all i,j 𝑖 𝑗 i,j italic_i , italic_j. Substituting into the metric:

SOVA⁢(X,Y)⁢@⁢k=∑i=1 k w i⁢∑j=1 m τ j′⋅|x i,j−y i,j|≤∑i=1 k w i⁢∑j=1 m τ j′⋅1=(∑i=1 k w i)⁢(∑j=1 m τ j′)=1.SOVA 𝑋 𝑌@𝑘 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′1 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚 superscript subscript 𝜏 𝑗′1\mathrm{SOVA}(X,Y)@k=\sum_{i=1}^{k}w_{i}\sum_{j=1}^{m}\tau_{j}^{\prime}\cdot|x% {i,j}-y{i,j}|\leq\sum_{i=1}^{k}w_{i}\sum_{j=1}^{m}\tau_{j}^{\prime}\cdot 1=% \left(\sum_{i=1}^{k}w_{i}\right)\left(\sum_{j=1}^{m}\tau_{j}^{\prime}\right)=1.roman_SOVA ( italic_X , italic_Y ) @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ 1 = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 .

To show attainability, suppose there exist Pareto frontiers X 𝑋 X italic_X and Y 𝑌 Y italic_Y where x i,j=1 subscript 𝑥 𝑖 𝑗 1 x_{i,j}=1 italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 and y i,j=0 subscript 𝑦 𝑖 𝑗 0 y_{i,j}=0 italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 for all i,j 𝑖 𝑗 i,j italic_i , italic_j. This satisfies |x i,j−y i,j|=1 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑖 𝑗 1|x_{i,j}-y_{i,j}|=1| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | = 1. Substituting into SOVA@k:

SOVA⁢(X,Y)⁢@⁢k=∑i=1 k w i⁢∑j=1 m τ j′⋅1=1.SOVA 𝑋 𝑌@𝑘 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′1 1\mathrm{SOVA}(X,Y)@k=\sum_{i=1}^{k}w_{i}\sum_{j=1}^{m}\tau_{j}^{\prime}\cdot 1% =1.roman_SOVA ( italic_X , italic_Y ) @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ 1 = 1 .

Thus, the maximum value of SOVA@k is 1. ∎

In this section, we have proved that SOVA@k is mathematically bounded between 0 and 1, ensuring that it remains a well-defined and interpretable ranking alignment metric. The weighting mechanisms—rank-based decay and objective weighting—preserve this property while allowing flexibility in prioritizing objectives and rank positions.

B.1 Expanded definition of SOVA@K with potential ties in ranks

The core idea is that if several points in Y 𝑌 Y italic_Y (in our application, the predicted Pareto front) share rank i 𝑖 i italic_i, we treat them as a single “group” for that rank and average their objective-wise differences against the corresponding x i(1)superscript subscript 𝑥 𝑖 1,x_{i}^{(1)},italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT in X 𝑋 X italic_X (in our case, the true Pareto front).

Definition B.3(SOVA@k with Ties in Ranks).

Let X=(x 1,…,x k)𝑋 subscript 𝑥 1…subscript 𝑥 𝑘 X=(x_{1},\dots,x_{k})italic_X = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and Y=(y 1,…,y k)𝑌 subscript 𝑦 1…subscript 𝑦 𝑘 Y=(y_{1},\dots,y_{k})italic_Y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be two ranked Pareto frontiers of size k 𝑘 k italic_k, where x i∈I subscript 𝑥 𝑖 𝐼 x_{i}\in I italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_I is the item at rank i 𝑖 i italic_i in X 𝑋 X italic_X (the true Pareto front), y i⊆I subscript 𝑦 𝑖 𝐼 y_{i}\subseteq I italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ italic_I is the set of items assigned to rank i 𝑖 i italic_i in Y 𝑌 Y italic_Y (the predicted Pareto front). With Γ Γ\Gamma roman_Γ we define the set of items that are ranked at the same position. All objectives x i,j subscript 𝑥 𝑖 𝑗 x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and y p,j subscript 𝑦 𝑝 𝑗 y_{p,j}italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT (for x i∈X subscript 𝑥 𝑖 𝑋 x_{i}\in X italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X, y p∈Γ subscript 𝑦 𝑝 Γ y_{p}\in\Gamma italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ roman_Γ) are normalized to [0,1]0 1[0,1][ 0 , 1 ]. Let w i subscript 𝑤 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (position weights) and τ j′superscript subscript 𝜏 𝑗′\tau_{j}^{\prime}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (objective weights) satisfy ∑i=1 k w i=1 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 1\sum_{i=1}^{k}w_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and ∑j=1 m τ j′=1 superscript subscript 𝑗 1 𝑚 superscript subscript 𝜏 𝑗′1\sum_{j=1}^{m}\tau_{j}^{\prime}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1. The Set-Based Order Value Alignment at k 𝑘 k italic_k is defined as:

SOVA⁢(X,Y)⁢@⁢k=∑i=1 k w i⋅∑j=1 m τ j′⋅(1|Γ i|⁢∑p∈Γ i|x i,j−y p,j|),SOVA 𝑋 𝑌@𝑘 superscript subscript 𝑖 1 𝑘⋅subscript 𝑤 𝑖 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′1 subscript Γ 𝑖 subscript 𝑝 subscript Γ 𝑖 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑝 𝑗\mathrm{SOVA}(X,Y)@k;=;\sum_{i=1}^{k}w_{i};\cdot;\sum_{j=1}^{m}\tau_{j}^{% \prime};\cdot;\Biggl{(}\frac{1}{|\Gamma_{i}|}\sum_{p\in\Gamma_{i}}\bigl{|},% x_{i,j}-y_{p,j}\bigr{|}\Biggr{)},roman_SOVA ( italic_X , italic_Y ) @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ ( divide start_ARG 1 end_ARG start_ARG | roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT | ) ,

where x i,j subscript 𝑥 𝑖 𝑗 x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and y p,j subscript 𝑦 𝑝 𝑗 y_{p,j}italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT denote the j 𝑗 j italic_j-th objective value of the i 𝑖 i italic_i-th item in X 𝑋 X italic_X and the p 𝑝 p italic_p-th item in Γ i subscript Γ 𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively.

Boundedness of SOVA@k with Ties

Under the assumptions of normalized objectives and normalized weights, SOVA⁢(X,Y)⁢@⁢k∈[0,1]SOVA 𝑋 𝑌@𝑘 0 1\mathrm{SOVA}(X,Y)@k\in[0,1]roman_SOVA ( italic_X , italic_Y ) @ italic_k ∈ [ 0 , 1 ].

Proof.

Non-negativity (SOVA⁢(X,Y)⁢@⁢k≥0 SOVA 𝑋 𝑌@𝑘 0\mathrm{SOVA}(X,Y)@k\geq 0 roman_SOVA ( italic_X , italic_Y ) @ italic_k ≥ 0): Since |x i,j−y p,j|≥0 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑝 𝑗 0|x_{i,j}-y_{p,j}|\geq 0| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT | ≥ 0, w i>0 subscript 𝑤 𝑖 0 w_{i}>0 italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, and τ j′≥0 superscript subscript 𝜏 𝑗′0\tau_{j}^{\prime}\geq 0 italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 0, every term in the summation is non-negative. Thus, SOVA⁢(X,Y)⁢@⁢k≥0 SOVA 𝑋 𝑌@𝑘 0\mathrm{SOVA}(X,Y)@k\geq 0 roman_SOVA ( italic_X , italic_Y ) @ italic_k ≥ 0.

Upper bound (SOVA⁢(X,Y)⁢@⁢k≤1 SOVA 𝑋 𝑌@𝑘 1\mathrm{SOVA}(X,Y)@k\leq 1 roman_SOVA ( italic_X , italic_Y ) @ italic_k ≤ 1):

1.Per-objective difference bound: Since objectives are normalized, |x i,j−y p,j|≤1 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑝 𝑗 1|x_{i,j}-y_{p,j}|\leq 1| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT | ≤ 1 for all i,j,p 𝑖 𝑗 𝑝 i,j,p italic_i , italic_j , italic_p.
2.Averaging within a rank: For rank i 𝑖 i italic_i, we might have multiple points in Γ i subscript Γ 𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Taking an average of numbers in [0,1]0 1[0,1][ 0 , 1 ] cannot exceed 1 1 1 1. Thus:

1|Γ i|⁢∑p∈Γ i|x i⁢j(1)−x p⁢j(2)|∈[0,1].1 subscript Γ 𝑖 subscript 𝑝 subscript Γ 𝑖 superscript subscript 𝑥 𝑖 𝑗 1 superscript subscript 𝑥 𝑝 𝑗 2 0 1\frac{1}{|\Gamma_{i}|}\sum_{p,\in,\Gamma_{i}}\bigl{|},x_{ij}^{(1)}-x_{pj}^{% (2)}\bigr{|};;\in;;[0,1].divide start_ARG 1 end_ARG start_ARG | roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_p italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT | ∈ [ 0 , 1 ] . 3. 3.Weighted summation over objectives: Since ∑j=1 m τ j′=1 superscript subscript 𝑗 1 𝑚 superscript subscript 𝜏 𝑗′1\sum_{j=1}^{m}\tau_{j}^{\prime}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1, we have:

∑j=1 m τ j′⋅(1|Y i|⁢∑y p∈Y i|x i,j−y p,j|)≤∑j=1 m τ j′⋅1=1.superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′1 subscript 𝑌 𝑖 subscript subscript 𝑦 𝑝 subscript 𝑌 𝑖 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑝 𝑗 superscript subscript 𝑗 1 𝑚⋅superscript subscript 𝜏 𝑗′1 1\sum_{j=1}^{m}\tau_{j}^{\prime}\cdot\Biggl{(}\frac{1}{|Y_{i}|}\sum_{y_{p}\in Y% {i}}|x{i,j}-y_{p,j}|\Biggr{)}\leq\sum_{j=1}^{m}\tau_{j}^{\prime}\cdot 1=1.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ ( divide start_ARG 1 end_ARG start_ARG | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT | ) ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ 1 = 1 . 4. 4.Weighted summation over ranks: Since ∑i=1 k w i=1 superscript subscript 𝑖 1 𝑘 subscript 𝑤 𝑖 1\sum_{i=1}^{k}w_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, the final metric satisfies:

SOVA⁢(X,Y)⁢@⁢k≤∑i=1 k w i⋅1=1.SOVA 𝑋 𝑌@𝑘 superscript subscript 𝑖 1 𝑘⋅subscript 𝑤 𝑖 1 1\mathrm{SOVA}(X,Y)@k\leq\sum_{i=1}^{k}w_{i}\cdot 1=1.roman_SOVA ( italic_X , italic_Y ) @ italic_k ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ 1 = 1 .

Attainability of bounds:

•Lower bound (0): Achieved when X=Y 𝑋 𝑌 X=Y italic_X = italic_Y (i.e., Y i={x i}subscript 𝑌 𝑖 subscript 𝑥 𝑖 Y_{i}={x_{i}}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for all i 𝑖 i italic_i), making |x i,j−y p,j|=0 subscript 𝑥 𝑖 𝑗 subscript 𝑦 𝑝 𝑗 0|x_{i,j}-y_{p,j}|=0| italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT | = 0.
•Upper bound (1): Achieved if x i,j=1 subscript 𝑥 𝑖 𝑗 1 x_{i,j}=1 italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 and y p,j=0 subscript 𝑦 𝑝 𝑗 0 y_{p,j}=0 italic_y start_POSTSUBSCRIPT italic_p , italic_j end_POSTSUBSCRIPT = 0 (or vice versa) for all i,j,p 𝑖 𝑗 𝑝 i,j,p italic_i , italic_j , italic_p, under valid Pareto dominance.

Thus, 0≤SOVA⁢(X,Y)⁢@⁢k≤1 0 SOVA 𝑋 𝑌@𝑘 1 0\leq\mathrm{SOVA}(X,Y)@k\leq 1 0 ≤ roman_SOVA ( italic_X , italic_Y ) @ italic_k ≤ 1. ∎

Key Features of the Metric

•Rank-Based Weighting: Each ranking position is assigned a weight that decreases exponentially as the rank increases, prioritizing the accuracy of higher-ranked points. As a consequence, errors in higher-ranked points are treated as more significant.
•Objective Weighting: Each objective is assigned a relative weight, allowing users to prioritize specific objectives.The user-provided weights are normalized directly to sum to 1.
•Distance Aggregation: The absolute differences between corresponding ranking positions in the two sets are calculated and weighted by their rank and objective importance. The total weighted distance is then aggregated across all ranks.
•[0-1] Bound:The metric is bounded in the range [0,1]: a value of 0 indicates perfect alignment between the rankings of the two sets, while value of 1 represents the maximum possible disagreement.

Appendix C Experimental setting

C.1 Hardware Specification

We have used three different gpu across our experiments:

•NVIDIA A100-SXM with 80GB of VRAM, equipped with a AMD EPYC 7742 64-Core Processor cpu.
•NVIDIA GeForce RTX 4090 with 24GB of VRAM, equipped with a AMD Ryzen 9 7900 12-Core Processor cpu.
•NVIDIA L40S with 45GB of VRAM, equipped with a AMD EPYC 7R13 Processor cpu.

C.2 Standard metrics to assess quality of Pareto solutions

Hausdorff distance

The Hausdorff distance quantifies the maximum deviation between the predicted and true Pareto fronts:

d H⁢(𝒫 pred,𝒫 true)=max⁡{sup p∈𝒫 pred inf q∈𝒫 true‖p−q‖,sup q∈𝒫 true inf p∈𝒫 pred‖p−q‖}.subscript 𝑑 𝐻 subscript 𝒫 pred subscript 𝒫 true subscript supremum 𝑝 subscript 𝒫 pred subscript infimum 𝑞 subscript 𝒫 true norm 𝑝 𝑞 subscript supremum 𝑞 subscript 𝒫 true subscript infimum 𝑝 subscript 𝒫 pred norm 𝑝 𝑞 d_{H}(\mathcal{P}{\text{pred}},\mathcal{P}{\text{true}})=\max\Bigg{{}\sup_{% p\in\mathcal{P}{\text{pred}}}\inf{q\in\mathcal{P}{\text{true}}}|p-q|,\sup% {q\in\mathcal{P}{\text{true}}}\inf{p\in\mathcal{P}_{\text{pred}}}|p-q|% \Bigg{}}.italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT ) = roman_max { roman_sup start_POSTSUBSCRIPT italic_p ∈ caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_q ∈ caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p - italic_q ∥ , roman_sup start_POSTSUBSCRIPT italic_q ∈ caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_p ∈ caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p - italic_q ∥ } .

where ‖p−q‖norm 𝑝 𝑞|p-q|∥ italic_p - italic_q ∥ is the Euclidean distance between solutions p 𝑝 p italic_p and q 𝑞 q italic_q in the objective space. Since in our scenario, the objective are bounded in [0,1]0 1[0,1][ 0 , 1 ], the a Hausdorff distance range of [0,2]0 2[0,\sqrt{2}][ 0 , square-root start_ARG 2 end_ARG ]. Smaller values of d H subscript 𝑑 𝐻 d_{H}italic_d start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT indicate better alignment between the predicted and true fronts.

Hypervolume

The Hypervolume (HV) quantifies the volume of the region in the objective space that is weakly dominated by a set of solutions and bounded with respect to a given reference point. Given a solution set 𝒫⊂ℝ m 𝒫 superscript ℝ 𝑚\mathcal{P}\subset\mathbb{R}^{m}caligraphic_P ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and a reference point r∈ℝ m 𝑟 superscript ℝ 𝑚 r\in\mathbb{R}^{m}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the HV is defined as:

H⁢V⁢(𝒫)=λ⁢(⋃p∈𝒫[p,r]),𝐻 𝑉 𝒫 𝜆 subscript 𝑝 𝒫 𝑝 𝑟 HV(\mathcal{P})=\lambda\left(\bigcup_{p\in\mathcal{P}}[p,r]\right),italic_H italic_V ( caligraphic_P ) = italic_λ ( ⋃ start_POSTSUBSCRIPT italic_p ∈ caligraphic_P end_POSTSUBSCRIPT [ italic_p , italic_r ] ) ,

where [p,r]𝑝 𝑟[p,r][ italic_p , italic_r ] denotes the hyperrectangle formed between point p 𝑝 p italic_p and the reference point r 𝑟 r italic_r, and λ 𝜆\lambda italic_λ is the Lebesgue measure in ℝ m superscript ℝ 𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, representing the volume. A larger HV value indicates that a greater portion of the objective space is dominated by 𝒫 𝒫\mathcal{P}caligraphic_P, implying a better approximation to the true front. When comparing two fronts, we compute the Hypervolume difference:

Δ⁢H⁢V=H⁢V⁢(𝒫 true)−H⁢V⁢(𝒫 pred),Δ 𝐻 𝑉 𝐻 𝑉 subscript 𝒫 true 𝐻 𝑉 subscript 𝒫 pred\Delta HV=HV(\mathcal{P}{\text{true}})-HV(\mathcal{P}{\text{pred}}),roman_Δ italic_H italic_V = italic_H italic_V ( caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT ) - italic_H italic_V ( caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ) ,

which captures the volume of the objective space that is dominated by the true front but not by the predicted one. Since the objectives are normalized in [0,1]m superscript 0 1 𝑚[0,1]^{m}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the HV values are bounded within [0,1]0 1[0,1][ 0 , 1 ] for m=2 𝑚 2 m=2 italic_m = 2, and smaller values of Δ⁢H⁢V Δ 𝐻 𝑉\Delta HV roman_Δ italic_H italic_V indicate better alignment between the predicted and true Pareto fronts.

NDCG

This metric evaluates the alignment of solution rankings in the predicted and true Pareto fronts, incorporating weights (ω A,ω E)subscript 𝜔 𝐴 subscript 𝜔 𝐸(\omega_{A},\omega_{E})( italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) to reflect the relative importance of validation performance and energy consumption, respectively, where ω A+ω E=1 subscript 𝜔 𝐴 subscript 𝜔 𝐸 1\omega_{A}+\omega_{E}=1 italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = 1. The NDCG at rank k 𝑘 k italic_k is computed as:

NDCG@⁢k=∑i=1 k 2 rel i−1 log 2⁡(i+1)∑i=1 k 2 rel i ideal−1 log 2⁡(i+1),NDCG@𝑘 superscript subscript 𝑖 1 𝑘 superscript 2 subscript rel 𝑖 1 subscript 2 𝑖 1 superscript subscript 𝑖 1 𝑘 superscript 2 superscript subscript rel 𝑖 ideal 1 subscript 2 𝑖 1\text{NDCG@}k=\frac{\sum_{i=1}^{k}\frac{2^{\text{rel}{i}}-1}{\log{2}(i+1)}}{% \sum_{i=1}^{k}\frac{2^{\text{rel}{i}^{\text{ideal}}}-1}{\log{2}(i+1)}},NDCG@ italic_k = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ideal end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG end_ARG ,

where rel i subscript rel 𝑖\text{rel}{i}rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rel i ideal superscript subscript rel 𝑖 ideal\text{rel}{i}^{\text{ideal}}rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ideal end_POSTSUPERSCRIPT denote the relevance scores of the predicted and ideal rankings, respectively. By varying the weights (ω A,ω E)subscript 𝜔 𝐴 subscript 𝜔 𝐸(\omega_{A},\omega_{E})( italic_ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ), we analyze ranking consistency under different prioritization preferences.

Recall

Recall measures the proportion of true Pareto-optimal solutions that are successfully identified in the predicted front. Let 𝒫 true subscript 𝒫 true\mathcal{P}{\text{true}}caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT denote the set of true Pareto-optimal solutions and 𝒫 pred subscript 𝒫 pred\mathcal{P}{\text{pred}}caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT the predicted ones. Then recall is computed as:

Recall=|𝒫 pred∩𝒫 true||𝒫 true|.Recall subscript 𝒫 pred subscript 𝒫 true subscript 𝒫 true\text{Recall}=\frac{|\mathcal{P}{\text{pred}}\cap\mathcal{P}{\text{true}}|}{% |\mathcal{P}_{\text{true}}|}.Recall = divide start_ARG | caligraphic_P start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ∩ caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_P start_POSTSUBSCRIPT true end_POSTSUBSCRIPT | end_ARG .

Recall values lie in the range [0,1]0 1[0,1][ 0 , 1 ], where higher values indicate that a larger fraction of the true Pareto front has been correctly predicted.

F1-score

The F1-score is the harmonic mean of precision and recall, providing a single measure that balances both aspects. It is particularly useful when one seeks a trade-off between including many relevant solutions (recall) and minimizing false positives (precision). Given precision P 𝑃 P italic_P and recall R 𝑅 R italic_R, the F1-score is defined as:

F1=2⋅P⋅R P+R.F1⋅2⋅𝑃 𝑅 𝑃 𝑅\text{F1}=2\cdot\frac{P\cdot R}{P+R}.F1 = 2 ⋅ divide start_ARG italic_P ⋅ italic_R end_ARG start_ARG italic_P + italic_R end_ARG .

Appendix D Competitors Details

ECNAS

We use NasBench-101, which restricts the architecture search space to 3×3 3 3 3\times 3 3 × 3 convolutions, 1×1 1 1 1\times 1 1 × 1 convolutions, and 3×3 3 3 3\times 3 3 × 3 max pooling. The original paper reports 10 trials with a population size of 10 over 100 evolutions. We adopt the SEMOA algorithm, identified as the best-performing method in their work. Their code produces a Pareto frontier of DAG-based architectures for each trial. From this, we select two architectures - one that maximises accuracy and another that balances accuracy with 50% energy consumption. These DAGs are then converted into architectures according to the specifications in the paper and the NasBench-101 GitHub repository 18 18 18https://github.com/google-research/nasbench. Since several architectures achieved optimal accuracy and balanced emissions, we randomly selected one from the two category considered and test it on CIFAR-10, following the specifications of the original paper, with the epoch budget obtained from the search.

CENAS

Employs reinforcement learning to optimize NAS algorithms based on GPU availability but is restricted to a narrow set of layer types, e.g. zeroization, skip-connection, 1x1 convolution, 3x3 convolution, and 3x3 average pooling. While CENAS is expected to return a Pareto frontier similar to EC-NAS, the available code does not output the architectures, making it difficult to analyze or reproduce results. Attempts to contact the authors for clarification went unanswered, leaving key implementation details uncertain.

KNAS

This approach prioritizes efficiency and sustainability during the architecture search process but does not account for emissions from the final trained model. It uses the same layer types of CENAS. Additionally, unlike other NAS methods, it does not produce a Pareto frontier, making it less transparent in terms of trade-offs between accuracy, efficiency, and resource consumption. Despite these limitations, we selected a model based on its reported performance and trained it using the original paper’s specifications to ensure a fair comparison. As well as EC-NAS, we test the selected architecture using CIFAR-10.

Appendix E Technical Addendum

E.1 Pareto Front Extraction and Filtering

After obtaining predictions for the two objectives at each epoch within the epoch space for each model configuration, we extract the Pareto fronts based on the ground truth and predicted objective metrics. We identify the Pareto-optimal points by checking for non-domination: a point is considered Pareto-optimal if there is no other point that is better in all objectives and strictly better in at least one objective. This process ensures that the selected points form the Pareto front, representing the best trade-offs among the objectives. Once identified the Pareto fronts, we apply a filtering step based on a user-defined threshold for the validation accuracy objective. This filter removes any solutions not meeting the required accuracy, focusing the analysis on the most relevant solutions for the task at hand. Figures 3, 5, 4 show the obtained Pareto curves on the three test datasets.

Figure 3: Comparison of True and Predicted Pareto Fronts on CIFAR-10. Pareto-optimal configurations are shown in the normalized Validation Accuracy vs. Energy space. Each subfigure corresponds to a different percentage of discarded test data (0%, 30%, 70%), while the predictor is trained with the same seed (42) in all cases. Gray dots represent all configurations evaluated with true target values. Blue markers show the True Pareto front, orange markers the Predicted Pareto front based on the predicted value of the objectives and green markers denote Predicted Pareto configurations re-evaluated with true values. Both true and predicted fronts include only configurations achieving at least 0.9 validation accuracy, filtered based on their respective values. The x-axis (Energy) is limited to the normalized range [0.00, 0.20], and the y-axis (Validation Accuracy) spans [0.6, 1.0] to enhance clarity.

Figure 4: Comparison of True and Predicted Pareto Fronts on Rotten_tomatoes. Pareto-optimal configurations are shown in the normalized Validation Accuracy vs. Energy space. Each subfigure corresponds to a different percentage of discarded test data (0%, 30%, 70%), while the predictor is trained with the same seed (42) in all cases. Gray dots represent all configurations evaluated with true target values. Blue markers show the True Pareto front, orange markers the Predicted Pareto front based on the predicted value of the objectives and green markers denote Predicted Pareto configurations re-evaluated with true values. Both true and predicted fronts include only configurations achieving at least 0.45 validation accuracy, filtered based on their respective values. The x-axis (Energy) is limited to the normalized range [0.00, 0.20], and the y-axis (Validation Accuracy) spans [0.4, 1.0] to enhance clarity.

Figure 5: Comparison of True and Predicted Pareto Fronts on FS-TKY. Pareto-optimal configurations are shown in the normalized Validation Accuracy vs. Energy space. Each subfigure corresponds to a different percentage of discarded test data (0%, 30%, 70%), while the predictor is trained with the same seed (42) in all cases. Gray dots represent all configurations evaluated with true target values. Blue markers show the True Pareto front, orange markers the Predicted Pareto front based on the predicted value of the objectives and green markers denote Predicted Pareto configurations re-evaluated with true values. Both true and predicted fronts include only configurations achieving at least 0.9 validation accuracy, filtered based on their respective values. The x-axis (Energy) is limited to the normalized range [0.00, 0.20], and the y-axis (Validation Accuracy) spans [0.80, 1.0] to enhance clarity.

E.2 Running Time and Time complexity Analysis

While traditional NAS algorithms require a complete re-run of the search process for each new dataset, GREEN adopts a different approach. By representing both datasets and models through features GREEN can operate directly at inference time, eliminating the need for repeated searches.

As for the time complexity of GREEN, it is primarily driven by its two main components: the target predictor and the ranker for Pareto solution. As for the first component, the complexity of the standard transformer attention mechanism per layer is 𝒪⁢(L 2⋅H⋅E H)=𝒪⁢(L 2⋅E)𝒪⋅superscript 𝐿 2 𝐻 𝐸 𝐻 𝒪⋅superscript 𝐿 2 𝐸\mathcal{O}(L^{2}\cdot H\cdot\frac{E}{H})=\mathcal{O}(L^{2}\cdot E)caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_H ⋅ divide start_ARG italic_E end_ARG start_ARG italic_H end_ARG ) = caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_E ) where H 𝐻 H italic_H is the number of heads of the transformer, L 𝐿 L italic_L is the sequence length and E H 𝐸 𝐻\frac{E}{H}divide start_ARG italic_E end_ARG start_ARG italic_H end_ARG represents the size of each head. For A 𝐴 A italic_A attention blocks and batch size B 𝐵 B italic_B, the total complexity becomes 𝒪⁢(B⋅L 2⋅E⋅A)𝒪⋅𝐵 superscript 𝐿 2 𝐸 𝐴\mathcal{O}(B\cdot L^{2}\cdot E\cdot A)caligraphic_O ( italic_B ⋅ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_E ⋅ italic_A ). Here, the quadratic term L 2 superscript 𝐿 2 L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT dominates the computation for long sequences. The time complexity for computing the Pareto front and ranking solutions in our approach is primarily determined by the full Pareto front selection process, which has a time complexity of 𝒪⁢(m⋅n 2)𝒪⋅𝑚 superscript 𝑛 2\mathcal{O}(m\cdot n^{2})caligraphic_O ( italic_m ⋅ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where n is the number of points in the dataset and m is the number of objectives (2). This step involves checking the dominance of each point against every other point, resulting in quadratic complexity. After the Pareto front is computed, we apply a filtering step based on the minimum accuracy threshold for a specific objective, which has a time complexity of 𝒪⁢(n)𝒪 𝑛\mathcal{O}(n)caligraphic_O ( italic_n ). This filtering step reduces the number of points considered in the subsequent ranking process. Following filtering, the ranking and normalization operations involve both normalization of the objectives 𝒪⁢(m⋅n)𝒪⋅𝑚 𝑛\mathcal{O}(m\cdot n)caligraphic_O ( italic_m ⋅ italic_n ) and weighted scoring with sorting 𝒪⁢(m⋅n+n⁢log⁡n)𝒪⋅𝑚 𝑛 𝑛 𝑛\mathcal{O}(m\cdot n+n\log n)caligraphic_O ( italic_m ⋅ italic_n + italic_n roman_log italic_n ), resulting in an overall complexity of 𝒪⁢(m⋅n+n⁢log⁡n)𝒪⋅𝑚 𝑛 𝑛 𝑛\mathcal{O}(m\cdot n+n\log n)caligraphic_O ( italic_m ⋅ italic_n + italic_n roman_log italic_n ). Thus, the dominant factor in the overall complexity is the 𝒪⁢(m⋅n 2)𝒪⋅𝑚 superscript 𝑛 2\mathcal{O}(m\cdot n^{2})caligraphic_O ( italic_m ⋅ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) complexity for the Pareto front selection. For our specific application, where the number of points is limited, this approach is well-suited. However, for larger datasets, algorithms with lower computational complexity can be applied.

E.3 Predictor Sanity Check

To assess whether the predictive function q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has learned meaningful properties beyond memorization, we performed a sanity check based on input perturbation. Specifically, we compared the original predictions reported in the paper with those obtained by duplicating each training example and halving the number of training epochs, keeping all other conditions fixed. This modification preserves the overall number of gradient updates while altering the training dynamics. The resulting differences in prediction accuracy, measured in terms of MAE for both objectives, are reported in Table 9. The small differences observed between the two configurations testify to the robustness of q θ subscript 𝑞 𝜃 q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, suggesting that the model captures generalizable patterns rather than overfitting to specific training dynamics.

Dataset	Discard percentage (%)	Δ⁢MAE A Δ subscript MAE 𝐴\Delta\text{MAE}_{A}roman_Δ MAE start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT	Δ⁢MAE E Δ subscript MAE 𝐸\Delta\text{MAE}_{E}roman_Δ MAE start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT
CIFAR-10	0	0.00474	0.00273
30	0.00629	0.00426
70	0.01370	0.00760
Rotten_tomatoes	0	0.00519	0.00472
30	0.00005	0.00267
70	0.01110	0.00170
FS-TKY	0	0.00301	0.00274
30	0.00251	0.00088
70	0.00229	0.00087

Table 9: Difference in MAE metrics across test datasets, comparing the predictions presented in the paper with those obtained by duplicating training examples and halving the number of training epochs (seed = 476).

Appendix F Additional Results

In this section, we present additional results from our experiments. Table 10 shows the performance of GREEN on each of the models for the 3 different datasets. The performance remain consistent with the results obtained in Section 6, showing that the proposed approach is able to achieve good performance on each of the selected models from EcoTaskSet.

Table 11 does something similar changing the values of learning rate, evidencing that with lower learning rate values, GREEN is able to better predict the performance of the selected network. This is more evident for the accuracy performance, while the predicted energy is always close to the ground truth.

| Dataset | Model | MAE A (↓↓\downarrow↓) | MAE E (↓↓\downarrow↓) | | --- | | CIFAR-10 | VIT | 0.181 ±plus-or-minus\pm± 0.000 | 0.034 ±plus-or-minus\pm± 0.004 | | AlexNet | 0.203 ±plus-or-minus\pm± 0.008 | 0.004 ±plus-or-minus\pm± 0.000 | | SqueezeNet | 0.099 ±plus-or-minus\pm± 0.021 | 0.006 ±plus-or-minus\pm± 0.002 | | ResNet18 | 0.079 ±plus-or-minus\pm± 0.010 | 0.007 ±plus-or-minus\pm± 0.000 | | EfficientNet | 0.041 ±plus-or-minus\pm± 0.005 | 0.010 ±plus-or-minus\pm± 0.000 | | VGG16 | 0.228 ±plus-or-minus\pm± 0.013 | 0.013 ±plus-or-minus\pm± 0.003 | | Rotten_tomatoes | RoBERTa | 0.063 ±plus-or-minus\pm± 0.035 | 0.007 ±plus-or-minus\pm± 0.002 | | BERT | 0.045 ±plus-or-minus\pm± 0.014 | 0.011 ±plus-or-minus\pm± 0.002 | | Mistral-7B-v0.3 | 0.314 ±plus-or-minus\pm± 0.023 | 0.053 ±plus-or-minus\pm± 0.006 | | Microsoft-PHI-2 | 0.131 ±plus-or-minus\pm± 0.016 | 0.027 ±plus-or-minus\pm± 0.003 | | FS-TKY | SASRec | 0.026 ±plus-or-minus\pm± 0.002 | 0.030 ±plus-or-minus\pm± 0.001 | | GRU4Rec | 0.022 ±plus-or-minus\pm± 0.001 | 0.035 ±plus-or-minus\pm± 0.001 | | BERT4Rec | 0.026 ±plus-or-minus\pm± 0.002 | 0.029 ±plus-or-minus\pm± 0.001 | | CORE | 0.020 ±plus-or-minus\pm± 0.003 | 0.030 ±plus-or-minus\pm± 0.001 |

Table 10: MAE of the predicted performance (A for accuracy and E for energy) with respect to the ground truth obtained from EcoTaskSet. This table shows the performance of GREEN on each of the models for the 3 different datasets. Overall, for each task, the performance remain consistent across the models. Some outliers (i.e., Mistral-7B-v0.3) could depend on the reduced number of epochs that we selected to train the models.

Dataset	Learning Rate	MAE A (↓↓\downarrow↓)	MAE E (↓↓\downarrow↓)
CIFAR-10	10−3 superscript 10 3 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT	0.043 ±plus-or-minus\pm± 0.012	0.012 ±plus-or-minus\pm± 0.001
10−2 superscript 10 2 10^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT	0.225 ±plus-or-minus\pm± 0.022	0.011 ±plus-or-minus\pm± 0.002
FS-TKY	10−3 superscript 10 3 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT	0.013 ±plus-or-minus\pm± 0.002	0.031 ±plus-or-minus\pm± 0.001
10−2 superscript 10 2 10^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT	0.034 ±plus-or-minus\pm± 0.002	0.031 ±plus-or-minus\pm± 0.001
Rotten_tomatoes	10−3 superscript 10 3 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT	0.115 ±plus-or-minus\pm± 0.021	0.022 ±plus-or-minus\pm± 0.002
10−2 superscript 10 2 10^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT	0.145 ±plus-or-minus\pm± 0.021	0.024 ±plus-or-minus\pm± 0.002

Table 11: Mean Absolute Error of the predicted performance (A for accuracy and E for energy) with respect to the ground truth obtained from EcoTaskSet. This table shows the performance of GREEN on two different learning rate values selected in our test set. This plots makes evident that with lower learning rate values, GREEN is able to better predict the performance of the selected network.

Architectural component
Batch Size	Hyperparameters
Learning rate
Number of classes	Task features
Class distribution
GPU L2 cache size	Infrastuctural features
GPU major revision number
GPU minor revision number
GPU total memory
GPU multi processor count
CPU bits
CPU number of cores
CPU Hz advertised
FLOPS	Model features
Number of parameters
Total number of layers
Type of activation functions
Number of Convolutional layers
Dimension of Output Channels of Convolutional Layers
Kernel Size of Convolutional Layers
Stride of Convolutional Layers
Number of Fully Connected Layers
Input Features of Fully Connected Layers
Number of Attention layers
Type of Attention Layers
Input Features of Attention Layers
Number of Heads in Attention Layers
LoRA rank in Attention Layers
Number of Embedding Layers
Embedding Dimension of Embedding Layers
Number of Batch Normalization Layers
Numerical Stability ϵ italic-ϵ\epsilon italic_ϵ of Batch Normalization Layers
Momentum of Batch Normalization Layers
Number of Layer Normalization Layers
Numerical Stability ϵ italic-ϵ\epsilon italic_ϵ of Layer Normalization Layers
Number of Dropout Layers
Dropout Probability of Dropout Layers
Discard percentage	Data features
Number of training examples
Number of validation examples
Image shape
Mean Pixel Values
Pixel Standard Deviation
Number of users
Number of items
Number of interactions
Interaction Density
Average Interaction Length
Median Interaction length
Mean sequence length
Maximum sequence length
Mean Flesch–Kincaid Grade level
Mean Dale-Chall Readability score

Table 12: List of all the features used to build EcoTaskSet. The features denoted with the label are used only for the recommendation tasks, the ones with the label are used only for the computer vision tasks and the ones with the are used only for the natural language processing tasks. The remaining are shared between tasks.

Generated on Fri May 2 05:36:16 2025 by L a T e XML

Xet Storage Details

Size:: 167 kB
Xet hash:: b008e83e1c88f900ecd886a06a67f7c08206d4d54050b0870881e3c1eaf89ab3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.