Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| def about(): | |
| _, centercol, _ = st.columns([1, 3, 1]) | |
| with centercol: | |
| st.markdown( | |
| """ | |
| ## Testing Semantic Importance via Betting | |
| We briefly present here the main ideas and contributions. | |
| """ | |
| ) | |
| st.markdown("""### 1. Setup""") | |
| st.image( | |
| "./assets/about/setup.jpg", | |
| caption="Figure 1: Pictorial representation of the setup.", | |
| use_column_width=True, | |
| ) | |
| st.markdown( | |
| """ | |
| We consider classification problems with: | |
| * **Input image** $X \in \mathcal{X}$. | |
| * **Feature encoder** $f:~\mathcal{X} \\to \mathbb{R}^d$ that maps input | |
| images to dense embeddings $H = f(X) \in \mathbb{R}^d$. | |
| * **Classifier** $g:~\mathbb{R}^d \\to [0,1]^k$ that separates embeddings | |
| into one of $k$ classes. We do not assume $g$ has a particular form and it | |
| can be any fixed, potentially nonlinear function. | |
| * **Concept bank** $c = [c_1, \dots, c_m] \in \mathbb{R}^{d \\times m}$ such | |
| that $c_j \in \mathbb{R}^d$ is the representation of the $j^{\\text{th}}$ concept. | |
| We assume thet $c$ is user-defined and that $m$ is small ($m \\approx 20$). | |
| * **Semantics** $Z = [Z_1, \dots, Z_m] = c^{\\top} H$ where $Z_j \in [-1, 1]$ represents the | |
| amount of concept $j$ present in the dense embedding of input image $X$. | |
| For example: | |
| * $f$ is the image encoder of a vision-language model (e.g., CLIP$^1$, OpenCLIP$^2$). | |
| * $g$ is the zero-shot classifier obtained by encoding *``A photo of a <CLASS_NAME>''* with the | |
| text encoder of the same vision-language model. | |
| * $c$ is obtained similarly by encoding the user-defined concepts. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| ### 2. Defining Semantic Importance | |
| Our goal is to test the statistical importance of the concepts in $c$ for the | |
| predictions of the given classifier on a particular image $x$ (capital letters denote random | |
| variables, and lowercase letters their realizations). | |
| We do not train a surrogate, interpretable model and instead consider the original, potentially | |
| nonlinear classifier $g$. This is because we want to study the semantic importance of | |
| the model that would be deployed in real-world settings and not a surrogate one that | |
| might decrease performance. | |
| We define importance from the perspective of conditional independence testing because | |
| it allows for rigorous statistical testing with false positive rate control | |
| (i.e., Type I error control). That is, the probability of falsely deeming a concept | |
| important is below a user-defined level $\\alpha \in (0,1)$. | |
| For an image $x$, a concept $j$, and a subset $S \subseteq [m] \setminus \{j\}$ (i.e., any | |
| subset that does not contain $j$), we define the null hypothesis: | |
| $$ | |
| H_0:~\hat{Y}_{S \cup \{j\}} \overset{d}{=} \hat{Y}_S, | |
| $$ | |
| where $\overset{d}{=}$ denotes equality in distribution, and $\\forall C \subseteq [m]$, | |
| $\hat{Y}_C = g(\widetilde{H}_C)$, $\widetilde{H}_C \sim P_{H \mid Z_C = z_C}$ is the conditional distribution of the dense | |
| embeddings given the observed concepts in $z_C$, i.e. the semantics of $x$. | |
| Then, rejecting $H_0$ means the concept $j$ affects the distribution of the response of | |
| the model, and it is important. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| ### 3. Sampling Conditional Embeddings | |
| """ | |
| ) | |
| st.image( | |
| "./assets/about/local_dist.jpg", | |
| caption=( | |
| "Figure 2: Example test (i.e., with concept) and null (i.e., without" | |
| " concept) distributions for a class-specific concept and a non-class" | |
| " specific one on three images from the Imagenette dataset as a" | |
| " function of the size of S." | |
| ), | |
| use_column_width=True, | |
| ) | |
| st.markdown( | |
| """ | |
| In order to test for $H_0$ defined above, we need to sample from the conditional distribution | |
| of the dense embeddings given certain concepts. This can be seen as solving a linear inverse | |
| problem stochastically since $Z = c^{\\top} H$. In this work, given that $m$ is small, we use | |
| nonparametric kernel density estimation (KDE) methods to approximate the target distribution. | |
| Intuitively, given a dataset $\{(h^{(i)}, z^{(i)})\}_{i=1}^n$ of dense embeddings with | |
| their semantics, we: | |
| 1. Use a weighted KDE to sample $\widetilde{Z} \sim P_{Z \mid Z_C = z_C}$, and then | |
| 2. Retrieve the embedding $H^{(i')}$ whose concept representation $Z^{(i')}$ is the | |
| nearest neighbor of $\widetilde{Z}$ in the dataset. | |
| Details on the weighted KDE and the sampling procedure are included in the paper. Figure 2 | |
| shows some example test (i.e., $\hat{Y}_{S \cup \{j\}}$) and | |
| null (i.e., $\hat{Y}_{S}$) distributions for a class-specific concept and a non-class | |
| specific one on three images from the Imagenette$^3$ dataset. We can see that the test | |
| distributions of class-specific concepts are skewed to the right, i.e. including the observed | |
| class-specific concept increases the output of the predictor. Furthermore, we see the shift | |
| decreases the more concepts are included in $S$, i.e. if $S$ is larger and it contains more | |
| information, then the marginal contribution of adding one concept will be smaller. | |
| On the other hand, including a non-class-specific concept does not change the distribution | |
| of the response of the model, no matter the size of $S$. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| ### 4. Testing by Betting | |
| Instead of classical hypothesis testing techniques based on $p$-values, we propose to | |
| test for the importance of concepts by *betting*.$^4$ This choice is motivated by two important | |
| properties of sequential tests: | |
| 1. They are **adaptive** to the hardness of the problem. That is, the easier it is to reject | |
| a null hypothesis, the earlier the test will stop. This induce a natural ranking of importance | |
| across concepts: if concept $j$ rejects faster than $j'$, then $j$ is more important than $j'$. | |
| 2. They are **efficient** because they only use as much data as needed to reject, instead of | |
| the entire data available as traditional, offline tests. | |
| Sequential tests instantiate a game between a *bettor* and *nature*. At every turn of the game, | |
| the bettor places a wager against the null hypothesis, and the nature reveals the truth. If | |
| the bettor wins, they will accumulate wealth, or loose some otherwise. More formally, the | |
| *wealth process* $\{K_t\}_{t \in \mathbb{N}_0}$ is defined as | |
| $$ | |
| K_0 = 1, \\quad K_{t+1} = K_t \cdot (1 + v_t\kappa_t), | |
| $$ | |
| where $v_t \in [-1,1]$ is a betting fraction, and $\kappa_t \in [-1,1]$ is the payoff of the bet. | |
| Under certain conditions, the wealth process describes a *fair game*, and for $\\alpha \in (0,1)$, | |
| it holds that | |
| $$ | |
| \mathbb{P}_{H_0}[\exists t:~K_t \geq 1/\\alpha] \leq \\alpha. | |
| $$ | |
| That is, the wealth process can be used to reject the null hypothesis $H_0$ with | |
| Type I error control at level $\\alpha$. | |
| Briefly, we use ideas of sequential kernelized independence testing (SKIT)$^5$ and define | |
| the payoff as | |
| $$ | |
| \kappa_t \coloneqq \\tanh\left(\\rho_t(\hat{Y}_{S \cup \{j\}}) - \\rho_t(\hat{Y}_S)\\right) | |
| $$ | |
| and | |
| $$ | |
| \\rho_t = \widehat{\\text{MMD}}(\hat{Y}_{S \cup \{j\}}, \hat{Y}_S) | |
| $$ | |
| is the plug-in estimator of the maximum mean discrepancy (MMD)$^6$ between the test and | |
| null distributions at time $t$. Furthermore, we use the online Newtown step (ONS)$^7$ method | |
| to choose the betting fraction $v_t$ and ensure exponential growth of the wealth. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| --- | |
| **References** | |
| [1] CLIP is available at https://github.com/openai/CLIP . | |
| [2] OpenCLIP is available at https://github.com/mlfoundations/open_clip . | |
| [3] The Imagenette dataset is available at https://github.com/fastai/imagenette . | |
| [4] Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication. | |
| Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):407-431, 2021. | |
| [5] Aleksandr Podkopaev et al. Sequential kernelized independence testing. In International | |
| Conference on Machine Learning, pages 27957-27993. PMLR, 2023. | |
| [6] Arthur Gretton et al. A kernel two-sample test. The Journal of Machine Learning Research, | |
| 13(1):723-773, 2012. | |
| [7] Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online | |
| learning in banach spaces. In Conference On Learning Theory, pages 1493-1529. PMLR, 2018. | |
| """ | |
| ) | |