135 kB

Title: On the Efficacy of Differentially Private Few-shot Image Classification

URL Source: https://arxiv.org/html/2302.01190

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Background 3Related Work 4Centralized Learning Experiments 5Federated Learning Experiments 6Discussion and Recommendations License: CC BY 4.0 arXiv:2302.01190v3 [stat.ML] 19 Dec 2023

*[inlinelist,1]label=(),

On the Efficacy of Differentially Private Few-shot Image Classification Marlon Tobaben marlon.tobaben@helsinki.fi University of Helsinki Aliaksandra Shysheya1 as2975@cam.ac.uk University of Cambridge John Bronskill jfb54@cam.ac.uk University of Cambridge Andrew Paverd andrew.paverd@microsoft.com Microsoft Shruti Tople shruti.tople@microsoft.com Microsoft Santiago Zanella-Béguelin santiago@microsoft.com Microsoft Richard E. Turner ret26@cam.ac.uk University of Cambridge Antti Honkela antti.honkela@helsinki.fi University of Helsinki These authors contributed equally Abstract

There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on private downstream datasets that are relatively large and similar in distribution to the pretraining data. However, in many applications including personalization and federated learning, it is crucial to perform well (i) in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and (ii) on datasets from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases. We also show that learning parameter-efficient FiLM adapters under DP is competitive with learning just the final classifier layer or learning all of the network parameters. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR benchmark.

1Introduction

It is well known that neural networks trained without formal privacy guarantees can be attacked to expose a subset of the training data (Carlini et al., 2021; Balle et al., 2022). For applications where training data are sensitive (Abowd, 2018; Cormode et al., 2018), it has become increasingly common to train under Differential Privacy (DP) (Dwork et al., 2006) which is considered to be the gold standard for protecting the privacy of individual training examples. Training with DP stochastic gradient descent (DP-SGD) (Rajkumar & Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), which adapts SGD to guarantee DP, typically impairs model performance due to gradient clipping and the addition of noise during training in order to mask the contribution of individual examples to model updates. However, there has been significant recent progress in training DP models which achieve accuracy that approaches the best non-private models in both NLP (Li et al., 2022b; Yu et al., 2022) and computer vision (Kurakin et al., 2022; De et al., 2022; Mehta et al., 2022; Cattan et al., 2022).

The majority of these approaches are based on transfer learning where the models have been pretrained on large public datasets and then fine-tuned (Yosinski et al., 2014) on a private downstream dataset with DP-SGD, as transfer learning has been shown to be highly effective on non-private data (Kolesnikov et al., 2020; Shysheya et al., 2022). In the non-private setting, the subset of model parameters to fine-tune ranges from all model parameters (Kolesnikov et al., 2020) to only the final layer, with the tuning of parameter-efficient adapters (Perez et al., 2018; Houlsby et al., 2019; Mahabadi et al., 2021) becoming increasingly prevalent. Transfer learning has also proven successful in the DP setting with (Yu et al., 2022) and without (Mehta et al., 2022) adapters.

However, strong DP results have only been demonstrated with relatively large datasets, with no extensive DP few-shot studies performed. The few-shot setting is crucial to any application where obtaining large amounts of labeled data is problematic. It is particularly significant in federated learning, where a global model is trained using data from multiple distributed users, and personalized federated learning, which involves customizing a federated learning model with a specific user’s data. In such scenarios, each user’s data may be sensitive and of limited size, such as medical images (Sheller et al., 2020), personal photos (Massiceti et al., 2021), or confidential personal data or actions entered on a mobile device (Differential Privacy Team, 2017; Ding et al., 2017).

In addition, the strong DP transfer learning results that have recently been reported have largely considered the case where the data distribution of the downstream dataset is similar to the pretraining data distribution (Tramèr et al., 2022). A more demanding test is out-of-domain transfer where more information needs to be extracted from the downstream dataset, making private learning more challenging. Support for differing data distributions is essential for frequently encountered specialist settings such as medical imaging, Earth imaging, or personalized object recognition.

In this work, we answer the question: Under what conditions is differentially private few-shot image classification effective? Our contributions are:

•

We provide the first comprehensive study on the efficacy of DP few-shot image classification. In particular, in the centralized setting we perform an exhaustive set of experiments that reveals how the accuracy of DP and non-private models are affected as the number of shots per class, privacy level, downstream dataset, model architecture, and the subset of learnable parameters in the model vary. We also investigate whether the trends observed in the centralized setting apply to federated learning. Novel insights include:

Amount of data required: It is known that classification accuracy under DP decreases as the level of privacy increases and the amount of data decreases, however: 1 we quantify how much more data is required under various levels of DP to match non-private accuracy. In particular, we found that the number of shots per class must be increased significantly to match non-private performance, depending on the subset of learnable parameters; and 2 we show that accuracy under DP is strongly related to the difficultly of the transfer learning task.

Model parameterization: We show that fine-tuning parameter-efficient FiLM adapters in addition to the final linear classifier layer performs close to or better than fine-tuning all parameters in the model or fine-tuning only the final layer under few-shot DP. This is demonstrated by superior accuracy for the FiLM configuration on the challenging VTAB-1k benchmark and establishing state-of-the-art in terms of accuracy (macro average precision increased from 44.3 % to 51.9 % ) and communication efficiency (cost reduced from 11.9M to 0.017M parameters per round) on the large-scale FLAIR federated learning benchmark.

Characterization of few-shot DP learning dynamics: We show that non-private few-shot transfer learners are generally in the interpolating regime where they achieve 100 % training accuracy. Under strong DP, trained networks are generally in the regularization regime where test and train accuracies are comparable.

•

We assess the vulnerability of DP few-shot models with a strong membership inference attack (MIA) and find that non-private models are highly susceptible and the privacy level must be increased to a high level to mitigate them.

•

Finally, we establish recommended practice guidelines for training DP few-shot models.

2Background

In this section, we provide background information, definitions, and nomenclature required for subsequent sections. We focus our analysis on few-shot transfer learning based image classifiers that rely on large backbones pretrained on non-private data.

Preliminaries We denote input images 𝒙 and image labels 𝑦 ∈ { 1 , … , 𝐶 } where 𝐶 is the number of image classes indexed by 𝑐 . Assume that we have access to a model 𝑓 ⁢ ( 𝒙 )

ℎ 𝜙 ⁢ ( 𝑏 𝜽 ⁢ ( 𝒙 ) ) that outputs class-probabilities for an image 𝑝 ⁢ ( 𝑦 | 𝒙 , 𝜽 , 𝜙 )

𝑓 ⁢ ( 𝒙 , 𝜽 , 𝜙 ) and comprises a feature extractor backbone 𝑏 𝜽 : ℝ 𝑑 → ℝ 𝑑 𝑏 with parameters 𝜽 pretrained on a large upstream public dataset such as Imagenet-21K (Russakovsky et al., 2015) where 𝑑 is the input image dimension and 𝑑 𝑏 is the output feature dimension, and a linear layer classifier or head ℎ 𝜙 : ℝ 𝑑 𝑏 → ℝ 𝐶 with weights 𝜙 . Let 𝒟

{ ( 𝒙 𝑛 , 𝑦 𝑛 ) } 𝑛

1 𝑁 be the private downstream dataset that we wish to fine-tune the model 𝑓 to. We denote the number of training examples per class or shot as 𝑆 .

Learnable Parameters In all experiments, the head parameters 𝜙 are initialized to zero and are always learned when fine-tuning on 𝒟 . For the backbone weights 𝜽 , we consider three options: 1 Head: 𝜽 are fixed at their pretrained values and do not change during fine-tuning, only the head parameters 𝜙 are updated; 2 All: 𝜽 are initialized with pretrained values, but can be updated during fine-tuning in addition to the head; and 3 FiLM: using FiLM (Perez et al., 2018) layers. There exists myriad of adaptors for both 2D convolutional and transformer networks including FiLM, Adapter (Houlsby et al., 2019), LoRA (Hu et al., 2022a), VPT (Jia et al., 2022), AdaptFormer (Chen et al., 2022c), NOAH (Zhang et al., 2022), Convpass (Jie & Deng, 2022), Model Patch (Mudrakarta et al., 2019), and CaSE (Patacchiola et al., 2022) that enable a pretrained network to adapt to a downstream dataset in a parameter-efficient manner. In this work, we use FiLM due to its simplicity, high performance, and low parameter count (Shysheya et al., 2022), though another adapter could be used. A FiLM layer scales and shifts the activations 𝒂 𝑖 ⁢ 𝑗 arising from the 𝑗 𝑡 ⁢ ℎ output of a layer in the 𝑖 𝑡 ⁢ ℎ block of the backbone as 𝙵𝚒𝙻𝙼 ⁢ ( 𝒂 𝑖 ⁢ 𝑗 , 𝛾 𝑖 ⁢ 𝑗 , 𝛽 𝑖 ⁢ 𝑗 )

𝛾 𝑖 ⁢ 𝑗 ⁢ 𝒂 𝑖 ⁢ 𝑗 + 𝛽 𝑖 ⁢ 𝑗 , where 𝛾 𝑖 ⁢ 𝑗 and 𝛽 𝑖 ⁢ 𝑗 are scalars. We implement FiLM by fixing 𝜽 at their pretrained values except for a subset of the scale and offset parameters utilized in the backbone normalization layers (e.g. GroupNorm, LayerNorm, etc., see Section A.3.1 for details), which can update during fine-tuning. For example, in a ResNet50, there are only 11 648 learnable FiLM parameters, which is fewer than 0.05% of 𝜽 .

Transfer Difficulty (TD) The overlap between the distributions of the pretraining data and the downstream dataset as well other factors such as the number of classes in the downstream dataset are key determinants of the ease and success of transfer learning. We measure the transfer difficulty (TD) as the relative difference between the accuracy of the All and Head learnable parameter configurations for a non-private model: 𝑇 ⁢ 𝐷

100 ⁢ ( 𝐴 ⁢ 𝑐 ⁢ 𝑐 𝐴𝑙𝑙 − 𝐴 ⁢ 𝑐 ⁢ 𝑐 𝐻𝑒𝑎𝑑 ) / 𝐴 ⁢ 𝑐 ⁢ 𝑐 𝐴𝑙𝑙 . This simple metric captures how different the downstream dataset is from the pretraining data as well as other factors that complicate transfer learning such as the number of classes 𝐶 in the downstream dataset and its size | 𝒟 | . If transfer learning is easy (i.e. TD is low), then only adapting the head of the network is sufficient. If transfer learning is more difficult (i.e. TD is high), then the backbone must also be adapted. Table 1 provides the TD values for all of the datasets used in the paper.

Differential Privacy (DP) DP (Dwork et al., 2006) is the gold standard for protecting sensitive data against privacy attacks. A stochastic algorithm is differentially private if it produces similar output distributions on similar datasets. More formally, ( 𝜖 , 𝛿 ) -DP with privacy budget 𝜖 ≥ 0 (lower means more private) and additive error 𝛿 ∈ [ 0 , 1 ] bounds how much the output distribution can diverge on adjacent datasets. We use add/remove adjacency, where two datasets are adjacent if one can be obtained from the other by adding or removing one data record, which could be a single datapoint in case of example-level privacy or data belonging to a single user in case of user-level privacy. The additive error is typically chosen such that 𝛿 < 1 / | 𝒟 | . We refer to Dwork & Roth (2014) for a thorough introduction to DP.

DP-SGD (Rajkumar & Agarwal, 2012; Song et al., 2013; Abadi et al., 2016) adapts stochastic gradient descent (SGD) to guarantee DP. DP-SGD selects mini-batches using Poisson sampling, clips the ℓ 2 norm of per-example gradients, and adds isotropic Gaussian noise to the sum of mini-batch gradients. The level of privacy ( ( 𝜖 , 𝛿 ) -DP) is controlled by the noise multiplier 𝜎 2 which scales the variance of the added noise, the number of steps, and the sampling ratio (the Poisson sampling probability, i.e., expected batch size/ | 𝒟 | ).

Membership Inference Attacks (MIAs) MIAs aim to determine if a particular example was used in the training set of a model (Shokri et al., 2017). MIAs can be used to derive lower bounds to complement the theoretical upper bounds of ( 𝜖 , 𝛿 ) -DP for trained models. While there are many types of MIA (Hu et al., 2022b), in this work we consider attacks that operate in the black-box mode (i.e. only model outputs can be observed) and can evaluate the loss on particular training or test examples (Carlini et al., 2022; Ye et al., 2022). In addition, we assume that attacks have access to images from the training data distribution and know the training algorithm used and its hyperparameters. To evaluate the effectiveness of a MIA, we examine the Receiver Operating Characteristic (ROC) curve which plots the attack true positive rate (TPR) against its false positive rate (FPR). We focus on the TPR at low FPR regime since a MIA is harmful if it can infer membership of even a small number of training examples with high confidence (Carlini et al., 2022).

3Related Work

DP Transfer Learning Section 1 describes various works where DP transfer learning using models pretrained on large public datasets achieve accuracy close to non-private approaches. However, to the best of our knowledge, there are no comprehensive studies on few-shot transfer learning under DP. The closest work to ours is Luo et al. (2021) where the authors evaluate DP fine-tuning of a sparse subset of the parameters of models pretrained on public data on a small number of few-shot downstream datasets. Their work employs a relatively small backbone (ResNet18), pretrained on a small public dataset (miniImageNet), with limited analysis. In contrast, our work utilizes large backbones, a large public pretraining set, a wider range of privacy levels and downstream datasets, in addition to assessing vulnerability to attacks and the federated learning setting. Tramèr et al. (2022) point out that current DP benchmarks rely excessively on downstream datasets with a high level of overlap with the pretraining data. Our work addresses this issue by evaluating on datasets with a wide range of TD.

Federated Learning (FL) and Transfer Learning There has been a recent surge of interest in using large pretrained models as initialization for training decentralized models in both NLP (Lin et al., 2022; Stremmel & Singh, 2021; Weller et al., 2022; Tian et al., 2022) and computer vision (Chen et al., 2022b; Tan et al., 2022; Qu et al., 2021; Chen et al., 2022a; Nguyen et al., 2022; Liu et al., 2022). Most of these works were able to improve upon state-of-the-art results under different tasks and settings within FL as well as showing that the client data heterogeneity problem often seen in FL can be partially mitigated with pretrained networks.

FL and DP Even though the server in FL does not have access to raw user data, the privacy of users may still be compromised if (i) the server is untrusted (Huang et al., 2021) or (ii) a third party has access to the model after training (Geiping et al., 2020; Carlini et al., 2022). Cryptographic techniques like secure aggregation Goryczka et al. (2013) can partially mitigate the former issue, while to fully tackle it as well as the latter, DP adaptations of the FL aggregation algorithms are needed McMahan et al. (2018). Similarly to DP-SGD, DP-FedAvg (McMahan et al., 2018) is an adaptation of the baseline FL algorithm FedAvg (McMahan et al., 2017), which provides user-level DP guarantees by applying the Gaussian mechanism to parameter updates sent to the server. Recently, a few studies have investigated the use of large pretrained models for FL under DP constraints in NLP Basu et al. (2021), representation learning Xu et al. (2022), and image classification Song et al. (2022). The closest work to ours is Song et al. (2022) who introduce FLAIR, a few-shot federated learning image classification dataset, which they use to perform a relatively small evaluation of pretrained models (only ResNet18 was used) fine-tuned using FL under DP. However, to the best of our knowledge, there are no other studies on how large pretrained models fine-tuned via FL aggregation algorithms behave under DP constraints for transfer-learned image classification. In this work we aim to fill this gap and evaluate these methods on real-world datasets.

4Centralized Learning Experiments

In our experiments, we endeavor to answer the question: “Under what conditions is differentially private few-shot image classification effective?” We focus on transfer learning approaches that utilize large backbones pretrained on public data. We do this empirically by varying the: 1 number of shots 𝑆 ; 2 set of learnable parameters in 𝑓 (All, Head, FiLM); 3 downstream dataset 𝒟 (with varying TD); and 4 network architecture: BiT-M-R50x1 (R-50) (Kolesnikov et al., 2020) with 23.5M parameters, Vision Transformer VIT-Base-16 (VIT-B) (Dosovitskiy et al., 2021) with 85.8M parameters, both pretrained on the ImageNet-21K dataset. In all experiments, we assume that the pretraining data is public and the downstream data is private. Source code for all experiments can be found at: https://github.com/cambridge-mlg/dp-few-shot.

Datasets For the experiments where 𝑆 is varied, we use the CIFAR-10 (low TD) and CIFAR-100 (medium TD) datasets (Krizhevsky, 2009) which are commonly used in DP transfer learning, and SVHN (Netzer et al., 2011) which has a high transfer difficulty and hence requires a greater degree of adaptation of the pretrained backbone. We also evaluate on the challenging VTAB-1k transfer learning benchmark (Zhai et al., 2019) that consists of 19 datasets grouped into three distinct categories (natural, specialized, and structured) with training set size fixed at | 𝒟 |

1000 and widely varying TD.

Training Protocol For all centralized experiments, we first draw 𝒟 of the required size ( | 𝒟 |

𝐶 ⁢ 𝑆 (i.e. the number of classes 𝐶 multiplied by shot 𝑆 ) for varying shot or | 𝒟 |

1000 for VTAB-1k) from the entire training split of the current dataset under evaluation. For the purposes of hyperparameter tuning, we then split 𝒟 into 70 % train and 30 % validation. We then perform 20 iterations of Bayesian optimization based hyperparameter tuning (Bergstra et al., 2011) with Optuna Akiba et al. (2019) to derive a set of hyperparameters that yield the highest accuracy on the validation data. This set of parameters is subsequently used to train a final model on all of 𝒟 . We evaluate the final, tuned model on the entire test split of the current dataset. Details on the set of hyperparameters that are tuned and their ranges can be found in Section A.3.2.

For DP fine-tuning on 𝒟 , we use Opacus (Yousefpour et al., 2021) and compute the required noise multiplier depending on the targeted ( 𝜖 , 𝛿 ) . We report the results over three runs. For all experiments, we set 𝛿

1 / | 𝒟 | and report ( 𝜖 , 𝛿 )-DP computed with the RDP accountant (Mironov, 2017). Note that because we often change the dataset size | 𝒟 | in our experiments this may make certain comparisons difficult since 𝛿 will also vary. Similarly to previous work (De et al., 2022; Mehta et al., 2022; Sander et al., 2022) we do not account for privacy loss originating from the tuning of the hyperparameters. See Section A.3 for additional training details.

4.1Few-shot DP Data Requirements

Fig. 1 depicts the performance of transfer learning under DP when varying 𝑆 , 𝜖 , and TD. Tabular results can be found in Tables 2, 3, 4, 5, 6 and 7. We see that accuracy decreases as 𝑆 and 𝜖 decrease and as TD increases. For 𝑆 ≤ 10 , accuracy is poor under DP. However, if the TD is low or medium, a moderate number of shots ( 𝑆 ≈ 100 ) is sufficient to approach the accuracy of the non-private setting. For example, at 𝑆

100 , the model achieves better than 90 % accuracy on CIFAR-10 using only 2 % of the full training split at 𝜖

1 . On the other hand, if TD is high, learning is more challenging and more shots are required to approach non-private accuracy. For example, for 𝑆

100 and 𝜖

2 , SVHN achieves just over 20 % accuracy and falls well short of non-private levels even at 𝑆

500 . Note that 𝛿 changes based on | 𝒟 | . Tables 24 and 23 provide ( 𝜖 , 𝛿 ) -DP guarantees computed for when 𝛿 is fixed and thus independent of | 𝒟 | .

Figure 1:Classification accuracy as a function of shots and 𝜖 for CIFAR-10, CIFAR-100 and SVHN. Backbone is VIT-B and the best performing configuration out of All, FiLM and Head is used for each combination of 𝜖 and 𝑆 , with 𝛿

1 / | 𝒟 | . The accuracy is reported over three seeds with the line showing the median and the band reporting the lowest and highest accuracy. Analysis: Classification accuracy decreases as 𝑆 and 𝜖 decrease and TD increases.

Fig. 2 shows the multiplier on the number of DP shots to match non-private accuracy (see Section A.2.2 for additional figures and details). On the left we average over 𝑆 ∈ { 5 , 10 } , datasets, and network architectures. For all configurations, at 𝜖

8 , 𝑆 must be increased by approximately 4 − 8 × to meet non-private accuracy and 20 − 35 × at 𝜖

1 . In effect, as the privacy level increases, the required multiplier increases in an exponential manner. The multipliers are lower for simpler forms of adaptation (e.g. Head requires 20 × 𝑆 at 𝜖

1 ) than for more complex forms (e.g. All requires 35 × 𝑆 at 𝜖

1 ). On the right we average over 𝑆 ∈ { 5 , 10 } , network architectures, and learnable parameters. Even though high TD datasets require more data for good accuracy, the multiplier values are similar and independent of the TD of the dataset (around 30 × at 𝜖

1 and 6 × at 𝜖

8 ). Fig. 3 shows the classification accuracy as a function of TD at 𝑆

100 . The accuracy gap between non-private and private training increases as TD increases.

Figure 2:Multiplier of shots required to reach non-private accuracy. Left: Average over 𝑆 ∈ { 5 , 10 } , datasets, and network architectures. Right: Average over 𝑆 ∈ { 5 , 10 } , network architectures, and learnable parameters. The data is obtained using linear interpolation. See Section A.2.2. 𝛿

1 / | 𝒟 | . Analysis: 𝑆 must be increased by approximately 20 − 35 × to meet non-private accuracy at 𝜖

1 and 4 − 8 × at 𝜖

8 Figure 3:Classification accuracy as a function of transfer difficulty (TD) and 𝜖 for CIFAR-10 (TD

1.0 ), EuroSAT (TD= 1.7 ), CIFAR-100 (TD= 7.8 ) and SVHN (TD

52.9 ) at 𝑆

100 . EuroSAT has been chosen because the result for 𝑆

100 can be easily taken from the VTAB results (Tables 8, 9, 10, 11, 12 and 13) due to 𝐶

10 . Backbone is VIT-B and the best performing configuration out of All, FiLM and Head is used for each 𝜖 , with 𝛿

1 / | 𝒟 | . The accuracy is reported over three seeds with the line showing the median and the band reporting the lowest and highest accuracy. Analysis: The accuracy gap between non-private and private training increases as TD increases. 4.2Characterization of Learning Under Few-Shot DP

In this section, we provide empirical evidence to highlight the different traits of private and non-private learning. Fig. 4 shows snapshots at 𝜖 ∈ { 1 , 8 , ∞ } of the train and test accuracies as a function of 𝑆 for CIFAR-100 (medium TD) and SVHN (high TD) (see Figs. 14 and 15 for versions with additional values of 𝜖 ). The three snapshots for each dataset can be viewed as discrete points on a continuum from low to high 𝜖 . We see that learning under DP is fundamentally different from non-private. Non-private models with sufficient capacity operate in the interpolating regime and attain close to 100 % training accuracy at all values of 𝑆 , but have substantially lower test accuracy when 𝑆 is low. In contrast, models that are are trained with DP-SGD are learning under heavy regularization and thus the training and test accuracies are significantly lower, but similar in value. When 𝑆 is low, test accuracy is relatively poor and as 𝑆 increases, test accuracy steadily improves. The point at which accuracy begins to improve varies with TD (CIFAR-100 test accuracy improvement starts much earlier than for SVHN). Independent of 𝑆 , for low 𝜖 , the train-test gap is very small, with the train accuracy indicative of the test performance. As 𝜖 increases, the train accuracy grows as the amount of regularization pressure from DP is reduced, ultimately entering the interpolating regime. For SVHN with 𝜖

∞ , Head leaves the interpolating regime for 𝑆

100 , as there is not enough capacity to adapt to a high TD dataset. As 𝜖 increases and 𝑆 remains low, the test accuracy does not increase as quickly as the train accuracy and the accuracy gap grows. However, as 𝑆 increases, test accuracy starts to catch up with train accuracy, reducing the gap. Fig. 21 shows the train and test accuracies for all 19 VTAB datasets as a function of 𝜖 , where the general trends noted in Fig. 4 are also evident.

Figure 4:Snapshots at 𝜖 ∈ { 1 , 8 , ∞ } , 𝛿

1 / | 𝒟 | of the train/test accuracies as a function of 𝑆 for CIFAR-100 and SVHN. The trends in accuracy gap are shown with red arrows. At low 𝑆 , the gap grows as 𝜖 increases, at high 𝑆 gap decreases to nearly zero, and the gap grows as 𝑆 decreases. Analysis: In the non-private setting ( 𝜖

∞ ), learning operates in the interpolation mode (i.e. train accuracy is 100 % , yet accuracy continues to increase as 𝑆 increases). As the privacy level increases, learning operates learn under heavy regularization and the gap between train and test accuracy reduces.

While the results of Section 4.1 indicate that both private and non-private test accuracy benefit from additional training data, it is evident that their learning behavior is significantly different.

4.3Few-shot DP Model Parameterization

Figure 5:Classification accuracy as a function of shots and learnable parameters on VIT-B for CIFAR-10, CIFAR-100 and SVHN for 𝜖 ∈ { 2 , ∞ } with 𝛿

1 / | 𝒟 | . The accuracy is reported over three seeds with the line showing the median and the band reporting the lowest and highest accuracy. Analysis: FiLM is comparable to or better than All and Head in terms of accuracy despite fine-tuning fewer than 0.05 % of the parameters in the backbone.

Fig. 5 depicts classification accuracy as a function of 𝑆 , two different values of 𝜖 , and learnable parameters. FiLM is comparable to or better than All and Head in terms of accuracy despite fine-tuning fewer than 0.05 % of the parameters in the backbone. When the TD is low, training only Head is competitive with FiLM and All, but when TD is medium or high, Head falls short as it cannot adapt the backbone to a dataset that has a different data distribution. These observations have two implications: 1 FiLMis able to adapt to differing downstream datasets under DP and serves as a computationally efficient alternative to All; 2 The result provides empirical support for the observations of Li et al. (2022a) that the number of parameters has little effect on the privacy utility trade-off when fine-tuning large pretrained models. Prior theory (Chaudhuri et al., 2011; Bassily et al., 2014) suggested that All should perform worse under DP compared to configurations with fewer parameters.

Figure 6:Average classification accuracy over all VTAB-1k datasets as a function of backbone, learnable parameters, and privacy level ( 𝜖 ) at 𝛿

10 − 3 . Colored columns indicate results under DP, light gray indicates non-private accuracy for the corresponding configuration. Analysis: DP classification accuracy (colored columns) decreases significantly as 𝜖 is decreased and always falls short of non-private accuracy (gray columns). For non-private settings, the All learnable parameters setting outperforms FiLM which outperforms Head. In contrast, for DP settings, All performs worst, FiLM and Head perform similarly, though FiLM is better in the majority of cases.

Fig. 6 shows average classification accuracy over all of the datasets in the VTAB-1k benchmark (tabular results are in Tables 8, 9, 10, 11, 12 and 13 and comprehensive graphical results are in Figs. 19 and 20). We see that DP classification accuracy decreases significantly as 𝜖 is decreased and always falls short of non-private accuracy. For non-private settings, the All learnable parameters setting outperforms FiLM which outperforms Head. In contrast, for DP settings, All performs worst, FiLM and Head perform similarly, though FiLM is better in the majority of cases. One explanation for this is that under DP at low 𝑆 , All requires more data compared to Head and FiLM for accuracy to progress beyond random chance as can be seen in Fig. 5.

Fig. 7 shows the difference between the accuracy of FiLM and Head for VTAB-1k datasets as a function of 𝜖 . The datasets are ordered from low to high TD (see Table 1). At 𝜖

1 , Head has an advantage over FiLM on several datasets. FiLM shows a significant advantage when the TD increases and as 𝜖 increases. Refer to Section A.2.5 for additional heat maps.

Figure 7:Heat map showing the accuracy difference between FiLM and Head for the VTAB-1k datasets as a function of 𝜖 . Backbone is VIT-B. Darker red indicates FiLM is better. Darker blue indicates Head is better. Datasets ordered from low to high TD. 𝛿

10 − 3 . Analysis: At 𝜖

1 , Head has an advantage over FiLM on several datasets. FiLM shows a significant advantage when the TD increases and as 𝜖 increases. 4.4Membership Inference Attacks

We use the state-of-the-art Likelihood Ratio Attack (LiRA) (Carlini et al., 2022) to attack models trained on CIFAR-100 with varying 𝑆 and privacy level 𝜖 using 256 shadow models. Refer to Section A.3.5 for additional detail. Excerpts from attack results are shown in Fig. 8. The complete set of attack ROC curves are shown in Figs. 22 and 23, while Table 14 reports TPR at several low FPR values, AUC score, and the maximum membership inference advantage (defined as TPR - FPR by Yeom et al. (2018)) achieved over the curve. Key observations are:

•

Non-private ( 𝜖

∞ ) models are extremely vulnerable to MIAs (see Fig. 8, middle). For example, in the case of 𝜖

∞ , 𝑆

10 , Head configuration, 82.2 % of the examples can be successfully identified with a false positive rate of only 0.1 % .

•

Vulnerability of non-private ( 𝜖

∞ ) models decreases as 𝑆 increases. Also, the FiLM configuration is consistently less vulnerable than Head (see Fig. 8, middle). We hypothesize that FiLM generalizes better, so training examples do not stand out as much as in the Head configuration.

•

When 𝑆 is fixed, vulnerability to MIAs greatly decreases with decreasing 𝜖 (see Fig. 8, right). Already with 𝜖

2 , when 𝑆

10 and FiLM the vulnerability to MIA is substantially reduced, 2.5 % of the examples can be successfully identified with an FPR of 1 % and 0.3 % of the examples with 0.1 % FPR (see Table 14).

•

Under DP, there appears to be little or no difference between the vulnerability of the FiLM and Head configurations at the same 𝜖 (see Fig. 8, right).

Figure 8:ROC curves for LiRA (Carlini et al., 2022) on CIFAR-100 with R-50 backbone for two values of 𝜖 (2 and ∞ ) where 𝑆 varies and for 𝑆

50 where 𝜖 varies. TPR values in legends are measured at FPR=0.001. Complete results in Tables 14, 22 and 23. 𝛿

1 / ( 100 ⁢ 𝑆 ) . Analysis: Middle - Non-private models are extremely vulnerable to MIAs. For 𝜖

∞ , 𝑆

10 , Head configuration, 82.2 % of the examples can be successfully identified with FPR = 0.1 % . Also, vulnerability decreases as 𝑆 increases. Right: Increasing the privacy level reduces the vulnerability of the model as expected, when 𝑆

10 with 𝜖

2 and FiLM, only 2.5 % of the examples can be successfully identified with a FPR of 1 % . 5Federated Learning Experiments

Figure 9:Left: Private (colored) and non-private (gray) FL performance on FLAIR as a function of backbone and learnable parameters. 𝜖

2 , 𝛿

41131 − 1.1 . We use Macro-AP as the primary metric to report accuracy for FLAIR. The R-18 All result on FLAIR is taken from Song et al. (2022). Our FLAIR results set a new state-of-the-art. Right: FLAIR communication cost – the number of parameters sent at every user-server communication round. Analysis: We set a new state-of-the-art result on the FLAIR benchmark by using FiLM on R-18 by increasing Macro-AP from 47.2% to 51.9% and drastically reducing the communication cost from 11.9M parameters per round to 17K parameters. We further improve those results by using bigger backbones (R-50, VIT-B) with corresponding decreases in communication cost.

In this section, we investigate how imposing user-level DP influences the performance of large pretrained models fine-tuned via federated aggregation. In our evaluation, we use FLAIR Song et al. (2022), which is a recently proposed real-world dataset for multi-label image classification. It has around 50 k users (overall around 400 k images) with heterogeneous data as well as a long-tailed label distribution, making it particularly appealing for benchmarking federated learning both in non-private and private settings. Comprising mainly natural image data, FLAIR is a low to medium TD dataset. As in (Song et al., 2022) , 𝛿 is set to 𝑁 − 1.1 , where 𝑁

41131 is the number of training clients, and 𝜖

2 . We also perform experiments on CIFAR-100 and Federated EMNIST, which have many fewer training users, but are widely used for benchmarking federated learning. Those results are in Section A.2.8.

As in the centralized experiments, we use R-50 and VIT-B, both pretrained on ImageNet-21K. We also perform experiments on a smaller architecture, ResNet18 (R-18) (He et al., 2016) pretrained on ImageNet-1K with 11.2M parameters, as it was initially used to achieve SOTA results on FLAIR.

For FL experiments, user-level DP is considered. We use FedADAM (Reddi et al., 2021) aggregation, which was shown to have better empirical performance than standard FedAvg (McMahan et al., 2017). We do not use Bayesian optimization for hyperparameter tuning, as each FL run is prohibitively expensive. Instead, we perform a small grid search over the server and client learning rates. Refer to Section A.3.6 for the hyperparameter ranges searched. For fair comparison on FLAIR, we fixed the other training hyperparameters to the values in the original paper Song et al. (2022).

Fig. 9 (left) shows the performance of different model configurations on FLAIR with (color) and without (grey) DP. We report macro average precision (Macro-AP) results here, while additional metrics are shown in Tables 15 and 16. As communication cost is important in FL, in Fig. 9 (right) we report the number of parameters required to be transmitted for each model configuration in one user-server interaction. Summarizing Fig. 9, key observations are:

•

With R-18 as used in the original paper, we achieve state-of-the-art performance under DP with FiLM, improving Macro-AP from 44.3 % to 51.9 % . This improvement comes with a reduction in communication cost from 11.2 M parameters per each user-server interaction to only 17 k.

•

With VIT-B we further improve the state-of-the-art result on FLAIR in both DP and non-private settings. Under DP, the Macro-AP increases to 59 % , while for non-private, the Macro-AP increases from 62.1 % to 74.7 % .

•

Head is more robust under DP than All or FiLM. Head has the smallest relative drop in performance of around 10 % for any model configuration.

•

Although FiLM is outperforming Head and All on R-18, it is not always the case for other backbones. However, taking into account both test performance and communication cost, we can clearly see that either FiLM or Head is preferred. FiLM performs better for smaller backbones (R-18 and R-50), while Head is slightly better for VIT-B.

6Discussion and Recommendations

Our work shows that DP few-shot learning works surprisingly well in the low TD setting, while the high TD setting is more difficult. Alternative strategies may include side-stepping privacy costs by leveraging the zero-shot capabilities of large pretrained models such as CLIP (Radford et al., 2021) or utilizing public data in addition to private data in the fine-tuning process as well (Golatkar et al., 2022) in order to improve utility. In summary, our experiments show that:

•

How much additional data is required under few-shot DP? Image classification accuracy decreases as 𝜖 and 𝑆 decrease, and as TD increases. As a result, one should expect to use roughly 4 − 8 × larger 𝑆 for 𝜖

8 and 20 − 35 × larger 𝑆 for 𝜖

1 under DP to achieve accuracy comparable to non-private. (Note that 𝛿

1 / | 𝐷 | for these multipliers.) The multipliers are surprisingly similar across different TD levels.

•

Transfer learning dynamics under DP are fundamentally different from non-private Non-private models with sufficient capacity operate in the interpolating regime and attain close to 100 % training accuracy at all values of 𝑆 , but have substantially lower test accuracy when 𝑆 is low. In contrast, models that are trained with DP-SGD are learning under heavy regularization and thus the training and test accuracies are significantly lower, but similar in value.

•

Parameter-efficient FiLM adapters perform well under DP FiLM is comparable to or better than All and Head in terms of accuracy, demonstrating its ability to adapt to differing downstream datasets despite fine-tuning fewer than 0.05 % of the parameters in the backbone. When the TD is easy, Head is competitive with FiLM and All, but when TD is difficult, Head falls short as it cannot adapt the backbone to a downstream dataset that has a different data distribution. FiLM is also effective in the DP FL setting, achieving state-of-the-art accuracy on the FLAIR benchmark while reducing communication cost by orders of magnitude.

•

Non-private Few-Shot Models Are Particularly Vulnerable to MIAs The vulnerability of non-private few-shot models increases as 𝑆 decreases. DP significantly mitigates the effectiveness of MIAs, e.g., we found that DP few-shot models can expose 2.5 % of the examples with a 1 % FPR when 𝜖

2 (on CIFAR-100 with 𝑆 =10, FiLM on R-50) which is substantially less vulnerable than the non-private models.

Limitations We identify the following limitations in this work: 1 We focused exclusively on few-shot transfer learning from relatively large pretrained models and did not consider meta-learning approaches or training from scratch. 2 We used FiLM adapters exclusively and did not consider other parameter-efficient adapters. Based on our experience, adapters do not have a large effect on the overall trends that we observed (in comparison to the items that we did vary), and making a fair comparison on a reasonable set of adapters would have exceeded our computational resources. 3 The Transfer Difficulty (TD) metric is not ideal as it depends on the network architecture and training hyperparameters, but in practice it aligns extremely well with the empirical difficulty of adapting to a downstream dataset. 4 We always set 𝛿

We compute the multiplier for a configuration and dataset at 𝜖 as follows: using the median accuracy obtained through the experiments depicted in Tables 2, 3, 4, 5, 6 and 7 ( 𝑆

1 , 5 , 10 , 25 , 50 , 100 , 250 , 500 ) we linearly interpolate the median accuracy in the complete 𝑆

[ 1 , 500 ] grid. We determine the minimum 𝑆 required to reach at least the same accuracy as for non-private at 𝑆 ∈ { 5 , 10 } using the 𝑆

[ 1 , 500 ] grid. The multiplier is then the minimum 𝑆 required for DP divided by the 𝑆 for non-private.

The Figs. 10 and 11 display the same analysis as Fig. 2 for all backbones (VIT-B, R-50) and non-private shots of 𝑆 ∈ { 5 , 10 } . The Figs. 12 and 13 display the same analysis grouped by datasets.

Figure 10:Multiplier of shots required to reach same accuracy as non-private with 𝑆

5 for VIT-B and R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. The multiplier is 1 for all 𝜖 for ViT-B with All parameters on SVHN at 𝑆

5 (top left plot) because non-DP achieves a random accuracy and achieving random accuracy requires 𝑆

1 for all configurations in the experiment. Figure 11:Multiplier of shots required to reach same accuracy as non-private with 𝑆

10 for VIT-B and R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. Figure 12:Multiplier of shots required to reach same accuracy as non-private with 𝑆 ∈ { 5 , 10 } for VIT-B on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. The multiplier is 1 for all 𝜖 for ViT-B with All parameters on SVHN at 𝑆

5 (top left plot) because non-DP achieves a random accuracy and achieving random accuracy requires 𝑆

1 for all configurations in the experiment. Figure 13:Multiplier of shots required to reach same accuracy as non-private with 𝑆 ∈ { 5 , 10 } for R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. A.2.3Additional versions of Fig. 4 Figure 14:Test and train classification accuracy as a function of shots and learnable parameters (All, FiLM and Head) on VIT-B for SVHN for different 𝜖 with 𝛿

1 / | 𝒟 | . The accuracy is reported for the median runs of Table 7. Figure 15:Test and train classification accuracy as a function of shots and learnable parameters (All, FiLM and Head) on VIT-B for CIFAR-100 for different 𝜖 with 𝛿

1 / | 𝒟 | . The accuracy is reported for the median runs of Table 6. A.2.4Comparison of Backbones for Effect of Shots and 𝜖

Fig. 16 compares the backbones (VIT-B, R-50) using their best performing configuration. The VIT-B backbone achieves comparable or better performance.

Figure 16:Classification accuracy for different 𝜖 as a function of 𝑆 and backbone (VIT-B, R-50) for CIFAR-10, CIFAR-100 and SVHN. TD (low, medium, high) refers to the transfer difficulty and is computed as in Section A.1. The best performing configuration out of All, FiLM and Head for each combination of 𝜖 , 𝑆 and backbone is used. The accuracy is reported over three seeds with the line showing the median and the band reporting the lowest and highest accuracy. A.2.5Advantage of FiLM as a Function of Shots

Figs. 17 and 18 show the difference between the mean classification accuracy of FiLM and Head. Darker red indicates FiLM is better. Darker blue indicates Head is better.

Figure 17:Heat map showing the accuracy advantage of FiLM over Head for CIFAR-10, CIFAR-100 and SVHN as a function of 𝜖 . Backbone is VIT-B. Darker red indicates FiLM is better. Darker blue indicates Head is better. Datasets ordered from low to high TD.

Figure 18:Heat map showing the accuracy advantage of FiLM over Head for CIFAR-10, CIFAR-100 and SVHN as a function of 𝜖 . Backbone is R-50. Darker red indicates FiLM is better. Darker blue indicates Head is better. Datasets ordered from low to high TD. A.2.6Additional VTAB-1k Results

𝟖
𝜖

∞

Caltech101 (Fei-Fei et al., 2006) 102 16.1±5.2 34.9±2.1 55.3±1.0 69.9±2.7 93.7±0.4 CIFAR100 (Krizhevsky, 2009) 100 7.1±0.4 14.3±0.7 24.2±1.5 36.2±4.0 84.2±0.3 Flowers102 (Nilsback & Zisserman, 2008) 102 10.6±2.9 33±4.9 77.3±6.9 96±1.2 99.5±0.0 Pets (Parkhi et al., 2012) 37 26.7±6.0 56.9±7.0 76.0±3.7 84.2±0.7 91.7±0.2 Sun397 (Xiao et al., 2010) 397 2.4±2.1 5.7±1.5 7.7±0.4 11.6±3.1 55.9±0.2 SVHN (Netzer et al., 2011) 10 22.9±1.5 28.8±0.7 34±5.6 44.3±9.0 91.6±0.8 DTD (Cimpoi et al., 2014) 47 17.3±1.1 29.3±2.4 41.1±1.2 51.7±5.0 76.7±0.5 EuroSAT (Helber et al., 2019) 10 74.3±1.4 78.9±2.2 86±1.4 91.6±1.6 96.3±0.5 Resics45 (Cheng et al., 2017) 45 16±2.4 28±1.6 45.7±3.3 60.8±2.1 88.4±0.4 Patch Camelyon (Veeling et al., 2018) 2 74.1±1.5 76.6±1.4 78.9±2.1 76.2±5.3 87.1±0.7 Retinopathy (Kaggle & EyePacs, 2015) 5 73.4±0.5 73.1±0.5 73.6±0.1 73.6±0.1 74.0±1.3 CLEVR-count (Johnson et al., 2017) 8 21.5±5.6 28.8±1.5 33.6±2.4 38.2±0.7 57.6±8.7 CLEVR-dist (Johnson et al., 2017) 6 27.0±1.8 36.4±3.5 42.2±3.2 45.8±1.3 57.2±2.5 dSprites-loc (Matthey et al., 2017) 16 6.4±0.5 7.9±3.4 22.7±2.6 37.6±5.0 66.8±5.2 dSprites-ori (Matthey et al., 2017) 16 7.9±2.1 11.1±3.7 13.5±6.5 19.9±2.9 50.1±1.1 SmallNORB-azi (LeCun et al., 2004) 18 5.9±0.7 7.9±0.8 8.5±0.4 11.4±2.3 18.3±0.7 SmallNORB-elev (LeCun et al., 2004) 9 14.5±1.0 17.0±3.8 18.7±4.4 26.7±0.1 38.3±2.9 DMLab (Beattie et al., 2016) 6 29.2±1.3 32.7±1.4 35.4±2.0 39.3±1.1 51.5±1.9 KITTI-dist (Geiger et al., 2013) 4 41.7±5.8 51.2±3.4 57.9±8.6 68.9±0.3 76.0±0.7 All 26.1 34.3 43.8 51.8 71.3 Natural 14.7 29.0 45.1 56.3 84.8 Specialized 59.4 64.2 71.0 73.6 86.4 Structured 19.3 24.1 29.1 36.0 52.0

Figs. 19 and 20 depict the complete set of VTAB-1k accuracy results as a function of dataset, privacy level ( 𝜖 ), backbone, and learnable parameters. The datasets are ordered increasingly by transfer difficulty (TD). Although classifiers for the Retinopathy dataset appear to perform equally well independently of 𝜖 , a closer inspection reveals that this dataset is unbalanced and learned classifiers predict the most common class in all settings.

Figure 19:Classification accuracy for VTAB-1k datasets as a function of privacy level ( 𝜖 ) and learnable parameters. Backbone is R-50. Dashed lines in all plots indicate non-private accuracy as a reference. The datasets are ordered increasingly by transfer difficulty (TD). Figure 20:Classification accuracy for VTAB-1k datasets as a function of privacy level ( 𝜖 ) and learnable parameters. Backbone is ViT-B. Dashed lines in all plots indicate non-private accuracy as a reference. The datasets are ordered increasingly by transfer difficulty (TD).

Fig. 21 depicts the final training and test accuracy as a function of 𝜖 and learnable parameters for all 19 VTAB-1k datasets with the ViT-B backbone. Although classifiers for the Retinopathy dataset appear to perform equally well independently of 𝜖 , a closer inspection reveals that this dataset is unbalanced and learned classifiers predict the most common class in all settings.

Figure 21:Test and train classification accuracy as a function of 𝜖 and learnable parameters (All, FiLM and Head) on VIT-B for all VTAB datasets with 𝛿

1 / | 𝒟 | . The accuracy is reported for the median run of Tables 13, 12 and 11. The datasets are in order of increasing transfer difficulty (TD) from left-to-right and top-to-bottom. A.2.7Additional Membership Inference Attack Results

Fig. 22 depicts the complete set of ROC curves for LiRA on CIFAR-100 with the R-50 backbone for various privacy levels ( 𝜖 ) and learnable parameters Head and FiLM at a fixed 𝑆 .

Fig. 23 depicts the complete set of ROC curves for LiRA on CIFAR-100 with the R-50 backbone for various shots 𝑆 at fixed privacy levels ( 𝜖 ) and learnable parameters Head and FiLM.

Table 14 presents the True Positive Rates (TPR) at various False Positive Rates (FPR), Area Under Receiver Operating Curve (AUC), and Attack Advantage (Attack Adv) (Yeom et al., 2018) for various privacy levels ( 𝜖 ) and shots per class (S) corresponding to the plots in Figs. 22 and 23.

Figure 22:ROC curves for LiRA (Carlini et al., 2022) on CIFAR-100 with R-50 backbone for various privacy levels ( 𝜖 ) and backbone configurations Head and FiLM at a fixed 𝑆 . TPR values in legends are measured at FPR=0.001. Figure 23:ROC curves for LiRA (Carlini et al., 2022) on CIFAR-100 with R-50 backbone for various 𝑆 at fixed privacy levels ( 𝜖 ) and backbone configurations Head and FiLM. TPR values in legends are measured at FPR=0.001. Table 14:True Postive Rates (TPR) at various False Positive Rates (FPR), Area Under Receiver Operating Curve (AUC), and Attack Advantage (Yeom et al., 2018) for various privacy levels ( 𝜖 ) and shots per class (S) corresponding to the plots in Figs. 22 and 23. Dataset ( 𝒟 ) is CIFAR-100. Backbone is R-50 pretrained on ImageNet-21k.

    TPR (%) @ 0.1% FPR	TPR (%) @ 1% FPR	TPR (%) @ 10% FPR	AUC	Attack Adv

𝜖

𝑆 Head FiLM Head FiLM Head FiLM Head FiLM Head FiLM 10 0.20 0.22 1.85 1.88 14.71 15.10 0.564 0.572 0.092 0.106 1 25 0.17 0.17 1.52 1.61 13.51 13.77 0.550 0.552 0.070 0.074 50 0.16 0.16 1.50 1.50 13.04 12.99 0.541 0.541 0.058 0.058 100 0.16 0.15 1.43 1.41 12.54 12.26 0.535 0.531 0.049 0.042 10 0.36 0.30 2.62 2.56 18.76 18.83 0.610 0.613 0.164 0.166 2 25 0.27 0.27 2.19 2.12 17.12 16.83 0.593 0.593 0.138 0.138 50 0.25 0.23 2.05 1.96 15.95 15.46 0.579 0.573 0.115 0.106 100 0.23 0.21 1.89 1.81 15.11 14.47 0.566 0.557 0.094 0.080 10 0.65 0.56 4.20 4.14 26.06 26.19 0.678 0.677 0.262 0.260 4 25 0.33 0.49 3.00 3.53 21.07 22.77 0.637 0.646 0.203 0.214 50 0.42 0.41 3.21 3.02 21.27 20.28 0.629 0.617 0.187 0.167 100 0.39 0.36 2.90 2.68 19.33 18.30 0.606 0.595 0.148 0.131 10 1.01 1.47 7.06 8.23 36.20 37.58 0.748 0.753 0.370 0.378 8 25 1.20 1.14 6.95 6.47 33.49 31.66 0.717 0.702 0.316 0.294 50 1.00 0.88 5.81 5.33 29.40 27.31 0.688 0.667 0.267 0.234 100 0.78 0.76 5.00 4.62 26.01 23.98 0.660 0.636 0.221 0.180 10 82.22 52.50 90.37 78.78 97.17 93.35 0.992 0.981 0.905 0.846

∞ 25 53.92 44.88 67.53 57.58 84.76 76.60 0.959 0.930 0.748 0.666 50 41.00 22.72 52.96 38.84 71.46 58.63 0.913 0.854 0.616 0.491 100 24.09 7.72 37.19 20.15 56.44 44.80 0.845 0.777 0.472 0.362

A.2.8Additional Federated Learning Results

Table 15 shows the non-private performance on the FLAIR dataset, while Table 16 shows the same performance under DP guaranties with 𝜖

2 .

Table 15:Non-private Federated Learning performance on FLAIR as a function of backbone 𝑏 𝜃 and learnable parameters. C stands for averaged per-class metrics (Macro) and O denotes overall metrics (Micro). P, R and AP denote precision, recall, and average precision, respectively. The R-18 All result is taken from the original paper Song et al. (2022). Due to the significant computational requirements, only a single random seed was used in all experiments on FLAIR. 𝒃 𝜽
𝜖 C-P O-P C-R O-R C-F1 O-F1 C-AP O-AP All ∞ 71.8 83.5 48.6 76.0 58.0 79.5 62.1 88.8 R-18 FiLM ∞ 73.8 82.0 44.8 74.4 55.7 78.0 59.7 87.7 Head ∞ 71.0 79.9 43.8 72.9 54.1 76.2 57.9 85.8 All ∞ 76.9 85.2 62.0 82 68.6 83.6 72.3 91.9 R-50 FiLM ∞ 78.3 83.8 57.9 80.0 66.6 81.9 70.2 90.6 Head ∞ 76 82.3 42.7 71.3 54.6 76.4 60.5 86.7 All ∞ 79.6 86.8 57.4 82.9 66.7 84.8 72.9 93.1 VIT-B FiLM ∞ 81.9 86.8 59.3 81.6 68.8 84.1 74.7 92.7 Head ∞ 81.6 83.7 52 72.2 63.4 77.5 70.0 87.6 Table 16:Federated Learning performance on FLAIR under DP with 𝜖

2 as a function of backbone 𝑏 𝜃 and learnable parameters. C stands for averaged per-class metrics (Macro) and O denotes overall metrics (Micro). P, R and AP denote precision, recall, and average precision, respectively. The R-18 All result is taken from the original paper Song et al. (2022). Due to significant computational requirements, only single random seed was used in all experiments with FLAIR. 𝒃 𝜽

𝜖 C-P O-P C-R O-R C-F1 O-F1 C-AP O-AP All 2 47.3 77.5 32.3 64.3 38.4 70.3 44.3 80.2 R-18 FiLM 2 59.0 81.0 39.1 70.3 47.0 75.3 51.9 85.2 Head 2 47.6 81.4 34.2 66.4 39.8 73.1 47.2 83.4 All 2 56.2 83.1 38.1 70.9 45.4 76.6 52.3 86.6 R-50 FiLM 2 59.7 79.3 39.4 69.9 47.5 74.3 51.3 84.2 Head 2 57.0 79.8 38.0 68.5 45.6 73.7 50.4 83.8 All 2 47.8 82.3 37.5 71.0 42.1 76.2 49.7 86.1 VIT-B FiLM 2 58.1 84.2 42.5 76 49.1 79.9 57.2 89.2 Head 2 67.1 83.4 39.8 68.9 50.0 75.5 59.0 85.9 CIFAR-100 and Federated EMNIST

Additionally, we perform experiments on CIFAR-100 and Federated EMNIST, which are commonly used to benchmark FL methods. We opt for these datasets as they have different degree of TD: CIFAR-100 has medium TD, while Federated EMNIST had high TD. For CIFAR-100, we use 500 training clients and 100 test clients, with each client having 100 samples and no clients sharing any data. To introduce more client heterogeneity, the data are distributed using the Pachinko Allocation Method (Li & McCallum, 2006) as in Reddi et al. (2021). Federated EMNIST Caldas et al. (2018) is a dataset of black-and-white handwritten symbols from 62 classes grouped according to the writer. EMNIST is a highly out-of-distribution dataset (i.e. high TD) with respect to the ImageNet-21K pretraining data. As the number of training users in CIFAR-100 ( 500 users) and Federated EMNIST ( 3400 users) is relatively low, we need to increase 𝜖 from 2 to 8 , such that the amount of added noise during aggregation is not excessive. 𝛿 is set to 𝑁 − 1.1 , where 𝑁 is the number of training clients. For CIFAR-100 and Federated EMNIST, we report standard test classification accuracy. All training details and hyperparameters are in Section A.3.6.

Fig. 24 shows the performance of different model configurations on CIFAR-100 and Federated EMNIST with and without DP. Table 17 illustrates private with 𝜖

8 and non-private performance on CIFAR-100 and Federated EMNIST. These tables present a tabular version of the results in Fig. 9.

Figure 24:Private ( 𝜖

8 , colored) and non-private ( 𝜖

∞ , gray) FL performance on CIFAR-100 (left) and Federated EMNIST (right) as a function of backbone and learnable parameters. We report accuracy on test clients. R-18 backbone is pretrained on ImageNet-1k, VIT-B and R-50 are pretrained on ImageNet-21k. Table 17:Federated Learning performance on CIFAR-100 and EMNIST with ( 𝜖

8 ) and without ( 𝜖

∞ ) DP as a function of backbone 𝑏 𝜃 and learnable parameters. Accuracy (in % ) is reported. R-18 backbone is pretrained on ImageNet-1k, VIT-B and R-50 are pretrained on ImageNet-21k. The ± sign indicates the 95 % confidence interval over 3 runs with different seeds.

        R-18			R-50			VIT-B

Dataset 𝜖 Head FiLM All Head FiLM All Head FiLM All CIFAR-100 ∞ 63.3±0.2 69.8±0.3 72.8±0.7 59.1±0.5 79.8±0.5 83.0±0.1 84.6±0.1 90.2±0.3 90.8±0.3 8 27.1±1.4 18.3±0.9 15.6±1.0 20.9±0.6 21.3±1.0 23.5±1.3 50.8±0.1 40.2±2.3 41.2±3.4 EMNIST ∞ 65.4±0.1 74.0±0.9 78.4±1.1 66.2±0.4 75.9±0.4 79.9±0.4 72.7±0.2 78.6±0.1 80.5±0.1 8 58.0±0.4 66.3±0.5 65.5±0.2 57.0±0.3 63.5±0.1 65.8±0.3 62.6±0.1 68.4±0.2 69.7±0.3

A.3Training and Evaluation Details A.3.1FiLM Layer Implementation

Table 18 details the locations and count of the parameters that are updateable for the FiLM configuration in each of the backbones used in the experiments.

Table 18:Backbone parameter count, FiLM parameter count, FiLM parameter count as a percentage of the backbone parameter count, and FiLM parameter locations within the backbone for each of the backbones used in the experiments.

Backbone Backbone Count FiLM Count FiLM (%) Locations R-18 11.2M 7808 0.07 GroupNorm Scale and Bias that follows each 3x3 Conv layer R-50 23.5M 11648 0.05 GroupNorm Scale and Bias that follows each 3x3 Conv layer Final GroupNorm Scale and Bias before Head

VIT-B 85.8M 38400 0.04 All LayerNorm Scale and Bias

A.3.2Hyperparameter Tuning

For all centralized experiments, we first draw 𝒟 of the required size ( | 𝒟 |

𝐶 ⁢ 𝑆 , or | 𝒟 |

1000 in the case of VTAB-1k) from the entire training split of the current dataset under evaluation. For the purposes of hyperparameter tuning, we then split 𝒟 into 70 % train and 30 % validation. We then perform 20 iterations of hyperparameter tuning using the tree-structured parzen estimator (Bergstra et al., 2011) strategy as implemented in Optuna (Akiba et al., 2019) to derive a set of hyperparameters that yield the highest accuracy on the validation split. This set of parameters are subsequently used to train a final model on all of 𝒟 . We the evaluate the final, tuned model on the entire test split of the current dataset. Details on the set of hyperparameters that are tuned and their ranges can be found in Table 19. For DP training, we compute the required noise multiplier depending on the target ( 𝜖 , 𝛿 ) -DP guarantee. The hyperparameter ranges are purposely broad and have been empirically derived. We fine-tune models for at most 200 epochs to limit the amount of compute necessary.

Table 19:Hyperparameter ranges used for the Bayesian optimization. lower bound upper bound epochs 1 200 learning rate 1e-7 1e-2 batch size 10 | 𝒟 |

clipping norm 0.2 10 noise multiplier Based on target 𝜖 A.3.3Effect of Shots per Class and 𝜖 Experiments

For each evaluated configuration, we draw | 𝒟 |

𝐶 ⁢ 𝑆 examples from the dataset training split, tune hyperparameters as described in Section A.3.2, and then test on the entire test split of the dataset. We use the DP-Adam optimizer as implemented in Opacus (Yousefpour et al., 2021) for all private experiments. For non-private experiments, we used the Adam (Kingma & Ba, 2015) optimizer for the Head and FiLM parameter configurations and the SGD optimizer for the All configuration. No data augmentation was used and images were scaled to 224 × 224 pixels.

All of the effect of 𝑆 and 𝜖 experiments were carried out on 1 (for Head and FiLM) and up to 3 (for All) NVIDIA V100 GPUs with 32GB of memory. The runtime for executing the whole experiment depends on the the size of the few-shot training set and the number of parameters resulting from the choice of the backbone and the number of learnable parameters (All > FiLM > Head). For CIFAR-10 and SVHN the runtime for one configuration ranges from less than 5 GPU minutes ( 𝑆

1 + Head) to 60 GPU hours ( 𝑆

500 + All). For CIFAR-100, the range is from 15 GPU minutes ( 𝑆

1 + Head) to over 700 GPU hours ( 𝑆

500

All).

A.3.4VTAB-1k Experiments

For each evaluated configuration of each of the 19 datasets in the VTAB-1k benchmark, we draw | 𝒟 |

1000 examples from the dataset training split, tune hyperparameters as described in Section A.3.2, and then test on the entire test split of the dataset. We use the DP-Adam optimizer as implemented in Opacus (Yousefpour et al., 2021) for all private experiments. For non-private experiments, we used the Adam (Kingma & Ba, 2015) optimizer for the Head and FiLM parameter configurations and the SGD optimizer for the All configuration.

No data augmentation was used. For the R-50 backbone, images were scaled to 384 × 384 pixels unless the image size was 32 × 32 pixels or less, in which case the images were scaled to 224 × 224 pixels. For the VIT-B backbone, images were scaled to 224 × 224 pixels.

All of the VTAB-1k transfer learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. Processing times for each configuration of each dataset will vary with the selected hyperparameters and the size of the test split, but approximate times are listed in Table 20.

Table 20:Approximate time to tune, train, and test a single configuration of parameters on a single VTAB-1k dataset for various backbones and parameter configurations. Units are wall clock GPU hours. Parameter Configuration Backbone None FiLM All R-50 0.6 0.9 2.7 VIT-B 1.3 2.4 6.5 A.3.5Membership Inference Attacks Experiments

For each setting of 𝑆 and 𝜖 , we first sample 2 ⁢ | 𝒟 | examples (recall | 𝒟 |

𝐶 ⁢ 𝑆

100 ⁢ 𝑆 ) from the CIFAR-100 training set, and then train 257 different models (1 target model plus 256 shadow models) where each sample for the training set is randomly selected with 50 % probability from the 2 ⁢ | 𝒟 | examples. This ensures that approximately half of the models are trained on each example and half are not so that we can create distributions over the losses for each example being in and out of the training set as described in the LiRA algorithm (Carlini et al., 2022). We use each of the trained models in turn as the target model and then accumulate the attack predictions over all 257 targets to produce the ROC curve for the attack. Due to the extreme computation demand in training a large number of shadow models for each setting of 𝑆 and 𝜖 , we restrict the attacks to the R-50 backbone and the Head and FiLM parameter configurations.

Our implementation is based on code from the TensorFlow Privacy library (Google, 2019b). All of the VTAB-1k transfer learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. When training the 257 models for each attack configuration, we do not perform hyperparameter tuning, instead we used the hyperparameter set from the CIFAR-100 experiments in Table 3 that yielded the highest accuracy for the particular configuration. Approximate training times for all 257 models in each configuration are listed on Table 21. The value of 𝜖 did not alter the training times to a significant degree.

Table 21:Approximate time to train 257 models for a single configuration of parameters for a LiRA attack on the CIFAR-100 dataset for various parameter and shot configurations. Units are wall clock GPU hours. Shot (S) Parameter Configuration 10 25 50 100 Head 6 12 16 46 FiLM 8 25 49 96 A.3.6Federated Learning Experiments

All experiments were performed in TensorFlow using tensorflow-federated Google (2019a) for federated aggregation and tensorflow-privacy Google (2019b) for privacy accounting and the adaptive clipping algorithm Andrew et al. (2021). CIFAR-100 and Federated EMNIST datasets were taken from tensorflow-federated.

FLAIR

Each model configuration is trained for 5000 rounds with a cohort size of 200 . Each sampled user trains the model locally with SGD for 2 epochs with local batch size set to 16 . The maximum number of images for each user is set to 512 . For DP, 𝜖

2 , 𝛿

𝑁 − 1.1 , where 𝑁 is the number of training users. As in the original paper, we set L2 norm quantile to 0.1 for adaptive clipping and we use 200 users sampled uniformly per round to simulate the noise-level with a cohort size of 5000 .

For the non-private setting we perform the grid search over:

•

server learning rate ∈ { 0.01 , 0.05 , 0.1 }

•

client learning rate ∈ { 0.01 , 0.05 , 0.1 }

For the private setting ( 𝜖

2 ) we fixed the client learning rate to the optimal value found for the non-private run and a perform grid search over the server learning rate in the set { 𝑎 / 2 , 𝑎 / 10 , 𝑎 / 50 , 𝑎 / 100 } , where 𝑎 is the optimal server learning rate found for the non-private setting.

Processing times for each configuration on FLAIR are given in Table 22.

Table 22:Approximate time to train and test a single configuration of parameters on FLAIR dataset for various backbones and parameter configurations. Units are wall clock GPU hours. Parameter Configuration Backbone Head FiLM All R-18 18 30 - R-50 30 43 60 VIT-B 40 60 75 CIFAR-100 and Federated EMNIST

Each model configuration is trained for 500 rounds with a cohort size of 20 . Each sampled user trains the model locally with SGD for 5 epochs with local batch size set to 100 . The maximum number of images for each user is set to 512 . For DP, 𝜖

8 , 𝛿

𝑁 − 1.1 , where 𝑁 is the number of training users ( 𝑁

500 for CIFAR-100, 𝑁

3400 for Federated EMNIST). As in the original paper, we set L2 norm quantile to 0.1 for adaptive clipping and we use 20 users sampled uniformly per round to simulate the noise-level with a cohort size of 100 .

For the non-private setting we perform the grid search over:

•

server learning rate ∈ { 0.05 , 0.1 , 0.5 }

•

client learning rate ∈ { 0.01 , 0.05 , 0.1 }

For the private setting ( 𝜖

8 ) we fixed the client learning rate to the optimal value found for the non-private run and perform a grid search over:

•

server learning rate ∈ { 𝑎 / 2 , 𝑎 / 10 , 𝑎 / 50 , 𝑎 / 100 } , where 𝑎 is the optimal server learning rate found for the non-private setting.

•

quantile for adaptive clipping bound ∈ { 0.1 , 0.5 , 0.8 }

A.3.7On the ( 𝜖 , 𝛿 ) -DP accounting

In the centralized experiments we compute the ( 𝜖 , 𝛿 ) -DP guarantees using the RDP accountant (Mironov, 2017) with 𝛿

1 / | 𝒟 | where 𝒟 where | 𝒟 |

𝐶 ⁢ 𝑆 (i.e. the number of classes 𝐶 multiplied by shot 𝑆 ). Setting 𝛿

1 / | 𝒟 | is a standard choice and simplifies comparisons with other papers. To allow for an easier comparison among different | 𝒟 | we provide Table 23 which illustrates the change of 𝜖 computed using the RDP accountant for 𝛿

1 ⁢ 𝑒 − 5 .

Additionally, we recompute the ( 𝜖 , 𝛿 ) -DP guarantees with the PRV accountant (Gopi et al., 2021), which is a accurate numerical accountant and results in slightly smaller 𝜖 than the RDP accountant given the same privacy parameters and 𝛿 . Table 24 shows the results for that.

Table 23:Recomputed 𝜖 at 𝛿

1 ⁢ 𝑒 − 5 as a function of 𝑆 for the datasets CIFAR-10, CIFAR-100 and SVHN and original 𝜖 ∈ { 1 , 2 , 4 , 8 } that was computed originally at 𝛿

1 / | 𝒟 | . The computation is done using the RDP accountant (Mironov, 2017) provided in opacus (Yousefpour et al., 2021). The ranges of 𝜖 result from the fact that there is not a direct mapping from the original 𝜖 to the recomputed 𝜖 but the recomputed 𝜖 depends on the used privacy parameters (noise multiplier, subsampling ratio and number of steps).

original 𝜖

𝟏 ⁢ 𝑺

𝟓 ⁢ 𝑺

𝟏𝟎 ⁢ 𝑺

𝟐𝟓 ⁢ 𝑺

𝟓𝟎 ⁢ 𝑺

𝟏𝟎𝟎 ⁢ 𝑺

𝟐𝟓𝟎 ⁢ 𝑺

𝟓𝟎𝟎 ⁢ 𝑺

1	3.30-3.32	2.20-2.33	1.94-2.20	1.69-1.71	1.54-1.56	1.43-1.46	1.30-1.34	1.22-1.24

CIFAR-10 2 5.41-5.43 3.95-4.49 3.56-3.60 3.18-3.22 2.95-2.97 2.76-2.92 2.54-2.66 2.41-2.50 4 9.14-9.16 7.14-8.25 6.57-6.78 5.99-6.11 5.61-5.73 5.31-5.68 4.96-5.31 4.73-4.87 8 15.80-15.82 13.02-13.84 12.19-13.41 11.35-11.71 10.72-11.57 10.25-10.83 9.67-10.50 9.28-9.51 1 1.94-1.97 1.54-1.65 1.43-1.44 1.30-1.34 1.22-1.23 1.16-1.17 1.08-1.09 1.03-1.04 CIFAR-100 2 3.56-3.64 2.95-2.98 2.75-2.77 2.54-2.55 2.41-2.42 2.29-2.30 2.16-2.17 2.07-2.08 4 6.60-6.81 5.63-5.73 5.32-5.42 4.96-5.03 4.73-4.94 4.53-4.55 4.29-4.30 4.14-4.15 8 12.26-12.74 10.76-10.86 10.25-10.48 9.66-9.85 9.28-9.39 8.94-9.03 8.52-8.55 8.25-8.29 1 3.30-3.32 2.19-2.33 1.94-1.96 1.69-1.71 1.54-1.56 1.43-1.50 1.30-1.31 1.22-1.24 SVHN 2 5.41-5.43 3.95-4.03 3.56-3.62 3.17-3.23 2.94-2.97 2.75-2.91 2.54-2.56 2.41-2.43 4 9.14-9.16 7.14-7.44 6.60-7.52 5.99-6.50 5.62-5.93 5.32-5.68 4.96-5.01 4.73-4.77 8 15.80-15.82 13.03-13.73 12.19-12.72 11.34-11.85 10.76-11.02 10.25-10.45 9.67-9.82 9.28-9.40

Table 24:Recomputed 𝜖 at 𝛿

1 ⁢ 𝑒 − 5 as a function of 𝑆 for the datasets CIFAR-10, CIFAR-100 and SVHN and original 𝜖 ∈ { 1 , 2 , 4 , 8 } that was computed originally at 𝛿

1 / | 𝒟 | . The computation is done using the PRV accountant (Gopi et al., 2021) provided in opacus (Yousefpour et al., 2021). The ranges of 𝜖 result from the fact that there is not a direct mapping from the original 𝜖 to the recomputed 𝜖 but the recomputed 𝜖 depends on the used privacy parameters (noise multiplier, subsampling ratio and number of steps).

original 𝜖

𝟏 ⁢ 𝑺

𝟓 ⁢ 𝑺

𝟏𝟎 ⁢ 𝑺

𝟐𝟓 ⁢ 𝑺

𝟓𝟎 ⁢ 𝑺

𝟏𝟎𝟎 ⁢ 𝑺

𝟐𝟓𝟎 ⁢ 𝑺

𝟓𝟎𝟎 ⁢ 𝑺

1	3.05-3.07	2.04-2.13	1.79-1.97	1.56-1.57	1.43-1.44	1.32-1.34	1.20-1.22	1.13-1.14

CIFAR-10 2 5.02-5.04 3.66-4.04 3.30-3.33 2.94-2.96 2.73-2.74 2.55-2.62 2.35-2.38 2.23-2.24 4 8.52-8.54 6.64-7.45 6.11-6.24 5.56-5.64 5.21-5.28 4.93-5.11 4.60-4.68 4.39-4.41 8 14.80-14.81 12.17-12.74 11.39-12.17 10.57-10.79 10.00-10.48 9.54-9.83 9.00-9.08 8.63-8.68 1 1.79-1.81 1.43-1.48 1.32-1.33 1.20-1.22 1.13-1.14 1.07-1.08 1.00-1.01 0.96-0.96 CIFAR-100 2 3.30-3.35 2.73-2.75 2.55-2.56 2.35-2.36 2.23-2.24 2.12-2.13 2.00-2.00 1.92-1.92 4 6.12-6.27 5.22-5.28 4.93-4.98 4.60-4.62 4.39-4.46 4.20-4.20 3.98-3.99 3.83-3.84 8 11.43-11.74 10.02-10.08 9.54-9.65 9.00-9.06 8.63-8.66 8.31-8.32 7.92-7.93 7.61-7.68 1 3.05-3.07 2.03-2.13 1.79-1.81 1.56-1.58 1.43-1.44 1.32-1.35 1.20-1.21 1.13-1.14 SVHN 2 5.02-5.04 3.66-3.71 3.30-3.34 2.94-2.97 2.72-2.74 2.55-2.62 2.35-2.36 2.23-2.24 4 8.52-8.54 6.64-6.85 6.12-6.71 5.56-5.83 5.22-5.37 4.93-5.11 4.60-4.62 4.38-4.40 8 14.80-14.81 12.18-12.65 11.39-11.73 10.57-10.86 10.02-10.16 9.54-9.64 9.00-9.05 8.64-8.66

Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Report Issue Report Issue for Selection

Xet Storage Details

Size:: 135 kB
Xet hash:: 3fa0e7a792bc395334d2b70357df1b99f12e5a97342a509198b6eb729d2457a5

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Preliminaries We denote input images 𝒙 and image labels 𝑦 ∈ { 1 , … , 𝐶 } where 𝐶 is the number of image classes indexed by 𝑐 . Assume that we have access to a model 𝑓 ⁢ ( 𝒙 )

ℎ 𝜙 ⁢ ( 𝑏 𝜽 ⁢ ( 𝒙 ) ) that outputs class-probabilities for an image 𝑝 ⁢ ( 𝑦 | 𝒙 , 𝜽 , 𝜙 )

{ ( 𝒙 𝑛 , 𝑦 𝑛 ) } 𝑛

Training Protocol For all centralized experiments, we first draw 𝒟 of the required size ( | 𝒟 |

𝐶 ⁢ 𝑆 (i.e. the number of classes 𝐶 multiplied by shot 𝑆 ) for varying shot or | 𝒟 |

For DP fine-tuning on 𝒟 , we use Opacus (Yousefpour et al., 2021) and compute the required noise multiplier depending on the targeted ( 𝜖 , 𝛿 ) . We report the results over three runs. For all experiments, we set 𝛿

100 , the model achieves better than 90 % accuracy on CIFAR-10 using only 2 % of the full training split at 𝜖

1 . On the other hand, if TD is high, learning is more challenging and more shots are required to approach non-private accuracy. For example, for 𝑆

100 and 𝜖

2 , SVHN achieves just over 20 % accuracy and falls well short of non-private levels even at 𝑆

Figure 1:Classification accuracy as a function of shots and 𝜖 for CIFAR-10, CIFAR-100 and SVHN. Backbone is VIT-B and the best performing configuration out of All, FiLM and Head is used for each combination of 𝜖 and 𝑆 , with 𝛿

Fig. 2 shows the multiplier on the number of DP shots to match non-private accuracy (see Section A.2.2 for additional figures and details). On the left we average over 𝑆 ∈ { 5 , 10 } , datasets, and network architectures. For all configurations, at 𝜖

8 , 𝑆 must be increased by approximately 4 − 8 × to meet non-private accuracy and 20 − 35 × at 𝜖

1 . In effect, as the privacy level increases, the required multiplier increases in an exponential manner. The multipliers are lower for simpler forms of adaptation (e.g. Head requires 20 × 𝑆 at 𝜖

1 ) than for more complex forms (e.g. All requires 35 × 𝑆 at 𝜖

1 ). On the right we average over 𝑆 ∈ { 5 , 10 } , network architectures, and learnable parameters. Even though high TD datasets require more data for good accuracy, the multiplier values are similar and independent of the TD of the dataset (around 30 × at 𝜖

1 and 6 × at 𝜖

8 ). Fig. 3 shows the classification accuracy as a function of TD at 𝑆

1 / | 𝒟 | . Analysis: 𝑆 must be increased by approximately 20 − 35 × to meet non-private accuracy at 𝜖

1 and 4 − 8 × at 𝜖

8 Figure 3:Classification accuracy as a function of transfer difficulty (TD) and 𝜖 for CIFAR-10 (TD

1.0 ), EuroSAT (TD= 1.7 ), CIFAR-100 (TD= 7.8 ) and SVHN (TD

52.9 ) at 𝑆

100 . EuroSAT has been chosen because the result for 𝑆

100 can be easily taken from the VTAB results (Tables 8, 9, 10, 11, 12 and 13) due to 𝐶

10 . Backbone is VIT-B and the best performing configuration out of All, FiLM and Head is used for each 𝜖 , with 𝛿

Figure 4:Snapshots at 𝜖 ∈ { 1 , 8 , ∞ } , 𝛿

Figure 5:Classification accuracy as a function of shots and learnable parameters on VIT-B for CIFAR-10, CIFAR-100 and SVHN for 𝜖 ∈ { 2 , ∞ } with 𝛿

Figure 6:Average classification accuracy over all VTAB-1k datasets as a function of backbone, learnable parameters, and privacy level ( 𝜖 ) at 𝛿

Fig. 7 shows the difference between the accuracy of FiLM and Head for VTAB-1k datasets as a function of 𝜖 . The datasets are ordered from low to high TD (see Table 1). At 𝜖

Figure 7:Heat map showing the accuracy difference between FiLM and Head for the VTAB-1k datasets as a function of 𝜖 . Backbone is VIT-B. Darker red indicates FiLM is better. Darker blue indicates Head is better. Datasets ordered from low to high TD. 𝛿

10 − 3 . Analysis: At 𝜖

Non-private ( 𝜖

∞ ) models are extremely vulnerable to MIAs (see Fig. 8, middle). For example, in the case of 𝜖

∞ , 𝑆

Vulnerability of non-private ( 𝜖

When 𝑆 is fixed, vulnerability to MIAs greatly decreases with decreasing 𝜖 (see Fig. 8, right). Already with 𝜖

2 , when 𝑆

Figure 8:ROC curves for LiRA (Carlini et al., 2022) on CIFAR-100 with R-50 backbone for two values of 𝜖 (2 and ∞ ) where 𝑆 varies and for 𝑆

50 where 𝜖 varies. TPR values in legends are measured at FPR=0.001. Complete results in Tables 14, 22 and 23. 𝛿

1 / ( 100 ⁢ 𝑆 ) . Analysis: Middle - Non-private models are extremely vulnerable to MIAs. For 𝜖

∞ , 𝑆

10 , Head configuration, 82.2 % of the examples can be successfully identified with FPR = 0.1 % . Also, vulnerability decreases as 𝑆 increases. Right: Increasing the privacy level reduces the vulnerability of the model as expected, when 𝑆

10 with 𝜖

Figure 9:Left: Private (colored) and non-private (gray) FL performance on FLAIR as a function of backbone and learnable parameters. 𝜖

2 , 𝛿

41131 is the number of training clients, and 𝜖

How much additional data is required under few-shot DP? Image classification accuracy decreases as 𝜖 and 𝑆 decrease, and as TD increases. As a result, one should expect to use roughly 4 − 8 × larger 𝑆 for 𝜖

8 and 20 − 35 × larger 𝑆 for 𝜖

1 under DP to achieve accuracy comparable to non-private. (Note that 𝛿

Non-private Few-Shot Models Are Particularly Vulnerable to MIAs The vulnerability of non-private few-shot models increases as 𝑆 decreases. DP significantly mitigates the effectiveness of MIAs, e.g., we found that DP few-shot models can expose 2.5 % of the examples with a 1 % FPR when 𝜖

Tables 2, 3, 4, 5, 6 and 7 depict tabular results for different backbones (R-50, VIT-B), different learnable parameter sets (Head, FiLM, All), different numbers of shots per class ( 𝑆

1 , 5 , 10 , 25 , 50 , 100 , 250 , 500 ) and various privacy levels ( 𝜖

1 , 2 , 4 , 8 , ∞ ), all at 𝛿

We compute the multiplier for a configuration and dataset at 𝜖 as follows: using the median accuracy obtained through the experiments depicted in Tables 2, 3, 4, 5, 6 and 7 ( 𝑆

1 , 5 , 10 , 25 , 50 , 100 , 250 , 500 ) we linearly interpolate the median accuracy in the complete 𝑆

[ 1 , 500 ] grid. We determine the minimum 𝑆 required to reach at least the same accuracy as for non-private at 𝑆 ∈ { 5 , 10 } using the 𝑆

Figure 10:Multiplier of shots required to reach same accuracy as non-private with 𝑆

5 for VIT-B and R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. The multiplier is 1 for all 𝜖 for ViT-B with All parameters on SVHN at 𝑆

5 (top left plot) because non-DP achieves a random accuracy and achieving random accuracy requires 𝑆

1 for all configurations in the experiment. Figure 11:Multiplier of shots required to reach same accuracy as non-private with 𝑆

10 for VIT-B and R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. Figure 12:Multiplier of shots required to reach same accuracy as non-private with 𝑆 ∈ { 5 , 10 } for VIT-B on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The data is obtained using linear interpolation of the median results of the experiments of Section A.2.1. The multiplier is 1 for all 𝜖 for ViT-B with All parameters on SVHN at 𝑆

5 (top left plot) because non-DP achieves a random accuracy and achieving random accuracy requires 𝑆

1 for all configurations in the experiment. Figure 13:Multiplier of shots required to reach same accuracy as non-private with 𝑆 ∈ { 5 , 10 } for R-50 on CIFAR-10, CIFAR-100 and SVHN with 𝛿

1 / | 𝒟 | . The accuracy is reported for the median runs of Table 7. Figure 15:Test and train classification accuracy as a function of shots and learnable parameters (All, FiLM and Head) on VIT-B for CIFAR-100 for different 𝜖 with 𝛿

Tables 8, 9, 10, 11, 12 and 13 depict tabular results for different backbones (R-50, ViT-B), different learnable parameter sets (Head, FiLM, All), and various privacy levels ( 𝜖

1 , 2 , 4 , 8 , ∞ ), all at 𝛿

dataset classes 𝜖

𝟏 𝜖

𝟐 𝜖

𝟒 𝜖

𝟖 𝜖

dataset classes 𝜖

𝟏 𝜖

𝟐 𝜖

𝟒 𝜖

𝟖 𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖

𝟏
𝜖

𝟐
𝜖

𝟒
𝜖

𝟖
𝜖