Buckets:
Title: PAC Prediction Sets for Large Language Models of Code
URL Source: https://arxiv.org/html/2302.08703
Markdown Content:
Abstract
Prediction sets have recently been shown to be a promising strategy for quantifying the uncertainty of deep neural networks in a way that provides theoretical guarantees. However, existing techniques have largely targeted settings where the space of labels is simple, so prediction sets can be arbitrary subsets of labels. For structured prediction problems where the space of labels is exponential in size, even prediction sets containing a small fraction of all labels can be exponentially large. In the context of code generation, we propose a solution that considers a restricted set of prediction sets that can compactly be represented as partial programs, which are programs with portions replaced with holes. Given a trained code generation model, our algorithm leverages a programming languageβs abstract syntax tree to generate a set of programs such that the correct program is in the set with high-confidence. Valuable applications of our algorithm include a Codex-style code generator with holes in uncertain parts of the generated code, which provides a partial program with theoretical guarantees. We evaluate our approach on PICARD (a T5 model for SQL semantic parsing) and Codex (a GPT model for over a dozen programming languages, including Python), demonstrating that our approach generates compact PAC prediction sets. This is the first research contribution that generates PAC prediction sets for generative code models.1 1 1 Our code is available at https://github.com/adamkhakhar/python-pac-code-prediction-set.
Machine Learning, ICML
1 Introduction
There has been a great deal of recent interest in uncertainty quantification for deep learning models(Abdar et al., 2021). These techniques are critical for applications of machine learning, where machine learning models are used to guide human decision-makers (e.g., medical or financial decisions), since they convey confidence in the model predictions that can be used to aid downstream decisions.
One promising strategy is conformal prediction(Wilks, 1941; Vovk et al., 2005), a statistical technique that has recently been adapted to machine learning(Tibshirani et al., 2019; Park et al., 2020; Bates et al., 2021). These techniques focus on constructing prediction sets, which capture uncertainty by predicting sets of labels rather than individual labels. A benefit of these approaches is that they come with provable guaranteesβtypically, that the prediction set includes the ground truth label with high probability.
Most of the work so far has focused on settings where the input x π₯ x italic_x may be complex, but the label y π¦ y italic_y has a relatively simple structure, such as classification and regression.2 2 2 Regression problems have an infinite label space, but existing approaches restrict to prediction sets in the form of intervals, which automatically satisfy the monotonicity property described below. While there is no theoretical obstacle to considering more structured labels, the prediction sets can easily become uninterpretable when the label space is complex as the uncertainty in the model is not easily identifiable. We consider the case of large language models for code generation, where the output is a sequence of tokens. For such models, the output space is exponentially large in the length of the generated sequence, meaning that even if a prediction set contains only a small fraction of labels, it may still be exponentially large.
While recent work has studied structured prediction problems such as object detection and image segmentation(Schulz & Behnke,), they avoid this problem by breaking up the prediction space into individual components for which compact prediction sets can be constructed. For instance, for object detection, while the output is a list of bounding boxes, we can construct a prediction set of bounding boxes that may occur, and then separately construct a prediction set for each bounding box.
A natural strategy to construct prediction sets for structured outputs is that rather than considering prediction sets consisting of arbitrary subsets of labels, we can restrict them to ones that can be represented in a compact way. However, existing prediction set algorithms are not designed to search over these structured spaces, meaning that new algorithms are necessary for inferring structured prediction sets.
In this paper, we study prediction sets for code generation, where the labels are sequences of tokens corresponding to valid lines of code. Recent work has demonstrated that large language models based on the GPT architecture(Brown et al., 2020) are effective strategies for code generation from context(Chen et al., 2021). For this domain, we propose to represent prediction sets using partial programs(Solar-Lezama, 2008), which are programs with some portions replaced with holes. A partial program implicitly represents the prediction set of all programs that can be obtained by filling its holes to produce a valid program. Partial programs are both a natural way of presenting sets of programs to the user and provide needed structure on the space of sets of programs. Prior work has constructed partial programs using large language models(Guo et al., 2021), but they do not interpret them as prediction sets, and furthermore do not provide any theoretical guarantees.
To construct PAC prediction sets in this setting, we propose a novel algorithm that modifies an existing one(Park et al., 2020) to account for the restricted search space. This algorithm operates by establishing a 1D search space over prediction setsβgiven a scoring function fβ’(x,y)ββ π π₯ π¦ β f(x,y)\in\mathbb{R}italic_f ( italic_x , italic_y ) β blackboard_R mapping features xβπ³ π₯ π³ x\in\mathcal{X}italic_x β caligraphic_X to labels yβπ΄ π¦ π΄ y\in\mathcal{Y}italic_y β caligraphic_Y (typically the probabilities predicted by a traditional neural network).
The main challenge adapting this strategy to prediction sets represented by partial programs is that the search space of partial programs is not 1D. To address this problem, we artificially construct a 1D search space by pre-selecting a set of partial programs to consider. As long as this set can be represented in the form in equation (1) for some scoring function, then we can use an existing prediction set algorithm to select among prediction sets in this set. The key condition the set must satisfy is monotonicity βi.e., each partial program must be either a superset or subset of each of the others. Then, to compute this set, we devise an integer linear program that encodes the monotonicity constraint along with other constraints on the structure of the partial programs.
We empirically evaluate our approach on both PICARD(Scholak et al., 2021), a state-of-the-art semantic parser based on T5(Raffel et al., 2019), trained on the Spider dataset(Yu et al., 2018) for SQL semantic parsing, as well as Codex(Chen et al., 2021), a GPT language model fine-tuned on publicly available GitHub code with proficiency in over a dozen programming languages. Our experiments demonstrate that our approach can generate prediction sets of the desired form that satisfy the PAC guarantee, while significantly outperforming a natural baseline in terms of a measure of prediction set size.
Example. In Figure0(a), we show an example of an SQL query from the Spider dataset along with the abstract syntax tree exposing its syntactic structure. In Figure0(b), we show the SQL query (and corresponding AST) as predicted by PICARD. Below the predicted query, we show the partial program obtained by our algorithm (it is obtained by deleting the nodes in the AST marked by red crosses). This partial program represents the prediction set of all programs that can be obtained by filling the ?? portions with some expressions. It is guaranteed to contain the ground truth query with high probability.
Contributions. Our contributions include the notion of partial programs as prediction sets for code generation, an algorithm for constructing PAC prediction sets in this setting, and an empirical evaluation demonstrating the efficacy of our algorithm. Finally, while we focus on code generation, we believe our techniques can be adapted to construct prediction sets for other structured prediction problems.
SELECT COUNT(*) FROM countries AS t1 JOIN car_makers AS t2 on t1.countryid = t2.country JOIN model_list as t3 on t2.id=t3.maker WHERE t1.countryname = "usa";
(a)Ground truth abstract syntax tree (AST) and SQL query from the Spider dataset(Yu et al., 2018).
SELECT COUNT() FROM countries AS t1
JOIN car_makers as t2 on t1.countryid = t2.country
WHERE t1.countryname = "usa";
SELECT COUNT() FROM countries AS t1
JOIN ?? on ?? WHERE t1.countryname = "usa";
(b)Predicted AST, predicted SQL query, and prediction set for the same task as in Figure0(a) with m=2 π 2 m=2 italic_m = 2 holes.
Figure 1: Example of ground truth SQL query, predicted SQL query, and the constructed prediction set. Note that the predicted query is incorrect. With the two holes in the SQL query, the predicted AST code prediction set contains the ground truth query.
2 Background
In this section, we introduce the notion of PAC prediction sets, as well as the code generation task.
2.1 PAC Prediction Sets
PAC prediction sets based on conformal prediction have recently been proposed as a rigorous way to quantify uncertainty for deep neural networks(Park et al., 2020). A prediction set model is a function F:π³β2 π΄:πΉβπ³ superscript 2 π΄ F:\mathcal{X}\to 2^{\mathcal{Y}}italic_F : caligraphic_X β 2 start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT mapping inputs x π₯ x italic_x to sets of labels Fβ’(x)βπ΄ πΉ π₯ π΄ F(x)\subseteq\mathcal{Y}italic_F ( italic_x ) β caligraphic_Y. We typically construct prediction sets based on a given scoring function f:π³Γπ΄ββ:πβπ³ π΄ β f:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_f : caligraphic_X Γ caligraphic_Y β blackboard_R, which maps features xβπ³ π₯ π³ x\in\mathcal{X}italic_x β caligraphic_X to labels yβπ΄ π¦ π΄ y\in\mathcal{Y}italic_y β caligraphic_Y (with higher scores corresponding to higher likelihood of being the true label). Scoring functions are often neural networks. Given f π f italic_f, we consider the family of prediction set models with a single parameter Οββ π β\tau\in\mathbb{R}italic_Ο β blackboard_R:
F Οβ’(x)β{yβπ΄β£fβ’(x,y)β₯Ο},βsubscript πΉ π π₯ conditional-set π¦ π΄ π π₯ π¦ π F_{\tau}(x)\coloneqq{y\in\mathcal{Y}\mid f(x,y)\geq\tau},italic_F start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x ) β { italic_y β caligraphic_Y β£ italic_f ( italic_x , italic_y ) β₯ italic_Ο } ,(1)
i.e., the set of all labels with score at least Ο π\tau italic_Ο. To define correctness, we assume that the data is sampled i.i.d. according to some distribution pβ’(x,y)π π₯ π¦ p(x,y)italic_p ( italic_x , italic_y ).
Definition 2.1.
A parameter Οβββ₯0 π subscript β absent 0\tau\in\mathbb{R}_{\geq 0}italic_Ο β blackboard_R start_POSTSUBSCRIPT β₯ 0 end_POSTSUBSCRIPT is Ο΅ italic-Ο΅\epsilon italic_Ο΅-correct if
β pβ’(x,y)β’[yβF Οβ’(x)]β₯1βΟ΅.subscript β π π₯ π¦ delimited-[]π¦ subscript πΉ π π₯ 1 italic-Ο΅\displaystyle\mathbb{P}{p(x,y)}\left[y\in F{\tau}(x)\right]\geq 1-\epsilon.blackboard_P start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_y β italic_F start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x ) ] β₯ 1 - italic_Ο΅ .
We let π― Ο΅ββ subscript π― italic-Ο΅ β\mathcal{T}{\epsilon}\subseteq\mathbb{R}caligraphic_T start_POSTSUBSCRIPT italic_Ο΅ end_POSTSUBSCRIPT β blackboard_R denote the set of Ο΅ italic-Ο΅\epsilon italic_Ο΅-correct Ο π\tau italic_Ο. Next, consider an estimator Ο^:π΅ββ:^πβsuperscript π΅ β\hat{\tau}:\mathcal{Z}^{}\to\mathbb{R}over^ start_ARG italic_Ο end_ARG : caligraphic_Z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT β blackboard_R, where π΅=π³Γπ΄ π΅ π³ π΄\mathcal{Z}=\mathcal{X}\times\mathcal{Y}caligraphic_Z = caligraphic_X Γ caligraphic_Y (with π΅*=βi=1βπ΅ i superscript π΅ superscript subscript π 1 superscript π΅ π\mathcal{Z}^{*}=\bigcup{i=1}^{\infty}\mathcal{Z}^{i}caligraphic_Z start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = β start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT β end_POSTSUPERSCRIPT caligraphic_Z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT) is the set of calibration datasets. We let pβ’(Z)=β(x,y)βZ pβ’(x,y)π π subscript product π₯ π¦ π π π₯ π¦ p(Z)=\prod_{(x,y)\in Z}p(x,y)italic_p ( italic_Z ) = β start_POSTSUBSCRIPT ( italic_x , italic_y ) β italic_Z end_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) be the distribution over datasets.
Definition 2.2.
An estimator Ο^^π\hat{\tau}over^ start_ARG italic_Ο end_ARG is (Ο΅,Ξ΄)italic-Ο΅ πΏ(\epsilon,\delta)( italic_Ο΅ , italic_Ξ΄ )-probably approximately correct (PAC) if
β pβ’(Z)β’[Ο^β’(Z)βπ― Ο΅]β₯1βΞ΄.subscript β π π delimited-[]^π π subscript π― italic-Ο΅ 1 πΏ\displaystyle\mathbb{P}{p(Z)}\left[\hat{\tau}(Z)\in\mathcal{T}{\epsilon}% \right]\geq 1-\delta.blackboard_P start_POSTSUBSCRIPT italic_p ( italic_Z ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Ο end_ARG ( italic_Z ) β caligraphic_T start_POSTSUBSCRIPT italic_Ο΅ end_POSTSUBSCRIPT ] β₯ 1 - italic_Ξ΄ .
Our goal is to construct PAC prediction sets that are small in size, where size is defined as
SΒ―β’(Ο)=πΌ pβ’(x,y)β’[Sβ’(F Οβ’(x))],Β―π π subscript πΌ π π₯ π¦ delimited-[]π subscript πΉ π π₯\displaystyle\bar{S}(\tau)=\mathbb{E}{p(x,y)}[S(F{\tau}(x))],overΒ― start_ARG italic_S end_ARG ( italic_Ο ) = blackboard_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_S ( italic_F start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x ) ) ] ,
for some given size metric S:2 π΄βββ₯0:πβsuperscript 2 π΄ subscript β absent 0 S:2^{\mathcal{Y}}\to\mathbb{R}{\geq 0}italic_S : 2 start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT β blackboard_R start_POSTSUBSCRIPT β₯ 0 end_POSTSUBSCRIPT. Assuming S π S italic_S is monotone (i.e., YβYβ²π superscript πβ²Y\subseteq Y^{\prime}italic_Y β italic_Y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT implies Sβ’(Y)β€Sβ’(Yβ²)π π π superscript πβ²S(Y)\leq S(Y^{\prime})italic_S ( italic_Y ) β€ italic_S ( italic_Y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT )), then larger values of Ο π\tau italic_Ο correspond to smaller sizes. Thus, our goal is to maximize Ο π\tau italic_Ο while guaranteeing that Οβπ― Ο΅ π subscript π― italic-Ο΅\tau\in\mathcal{T}{\epsilon}italic_Ο β caligraphic_T start_POSTSUBSCRIPT italic_Ο΅ end_POSTSUBSCRIPT.
For this setting, there exists an estimator Ο^^π\hat{\tau}over^ start_ARG italic_Ο end_ARG to construct a PAC prediction set. These algorithms optimize Ο π\tau italic_Ο subject to a constraint on the number of errors in the validation set:
Ο^β’(Z)=argβ‘max Οβββ‘Οβ’subj. toβ’βi=1 n πβ’(y iβF Οβ’(x i))β€k,^π π subscript π β π subj. to superscript subscript π 1 π 1 subscript π¦ π subscript πΉ π subscript π₯ π π\displaystyle\hat{\tau}(Z)=\operatorname*{\arg\max}{\tau\in\mathbb{R}}\tau{}% ~{}\text{subj. to}{}~{}\sum{i=1}^{n}\mathbbm{1}(y_{i}\not\in F_{\tau}(x_{i})% )\leq k,over^ start_ARG italic_Ο end_ARG ( italic_Z ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_Ο β blackboard_R end_POSTSUBSCRIPT italic_Ο subj. to β start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β italic_F start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) β€ italic_k ,(2)
where k π k italic_k can be computed from Ο΅ italic-Ο΅\epsilon italic_Ο΅, Ξ΄ πΏ\delta italic_Ξ΄, and n π n italic_n (see(Park et al., 2020) for details). In other words, we choose the largest Ο π\tau italic_Ο subject to a bound k π k italic_k on the number of errors made by the prediction set.
2.2 Code Generation
We consider the problem of generating code in some language, where the label is a sequence of tokensβi.e., we have π΄=Ξ£π΄ superscript Ξ£\mathcal{Y}=\Sigma^{}caligraphic_Y = roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where Ξ£ Ξ£\Sigma roman_Ξ£ is the set of tokens. While our approach is general, we focus on strategies based on large language models (LLMs), where the input x π₯ x italic_x consists of the generation context (e.g., several lines of code preceding the line to be generated); recent work has demonstrated that LLMs are very effective models for solving this problem(Iqbal & Qureshi, 2022). In this strategy, tokens represent parts or whole words(Sennrich et al., 2015), which are then predicted, either autoregressively(Liu et al., 2022) or sequence-to-sequence(Sutskever et al., 2014), to form a full target when concatenated.
2.3 Abstract Syntax Trees
Our strategy for constructing prediction sets relies on the abstract syntax tree of the generated code. This data structure is similar to a parse tree obtained by using a context-free grammar (CFG) parser to parse a sentence, but aims to represent constructs in the code (e.g., variables, assignment statements, if statements, etc.) while abstracting away low-level syntax (e.g., parentheses matching).
Formally, given a (valid) sequence of tokens yβΞ£π¦ superscript Ξ£ y\in\Sigma^{}italic_y β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, an AST is a tree T=(V,E)π π πΈ T=(V,E)italic_T = ( italic_V , italic_E ), with nodes vβV π£ π v\in V italic_v β italic_V and edges (v,vβ²)βEβVΓV π£ superscript π£β²πΈ π π(v,v^{\prime})\in E\subseteq V\times V( italic_v , italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β italic_E β italic_V Γ italic_V. Each node vβV π£ π v\in V italic_v β italic_V represents some subsequence of tokens in y π¦ y italic_y (the subsequence of a node must be contained in the subsequence of its parent). This subsequence is given by a mapping Ξ·:Vββ 2:πβπ superscript β 2\eta:V\to\mathbb{N}^{2}italic_Ξ· : italic_V β blackboard_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where Ξ·β’(v)=(i,j)π π£ π π\eta(v)=(i,j)italic_Ξ· ( italic_v ) = ( italic_i , italic_j ) indicates that node v π£ v italic_v corresponds to subsequence y iβ’β¦β’y jβ1 subscript π¦ πβ¦subscript π¦ π 1 y_{i}...y_{j-1}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β¦ italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT of y π¦ y italic_y. Then, we assume that if Ξ·β’(v)=(i,j)π π£ π π\eta(v)=(i,j)italic_Ξ· ( italic_v ) = ( italic_i , italic_j ), then its parent vβ²superscript π£β²v^{\prime}italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT satisfies Ξ·β’(vβ²)=(iβ²,jβ²)π superscript π£β²superscript πβ²superscript πβ²\eta(v^{\prime})=(i^{\prime},j^{\prime})italic_Ξ· ( italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = ( italic_i start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) for some iβ²β€i<jβ€jβ²superscript πβ²π π superscript πβ²i^{\prime}\leq i<j\leq j^{\prime}italic_i start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β€ italic_i < italic_j β€ italic_j start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT (here, the parent vβ²superscript π£β²v^{\prime}italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT is the unique node vβ²βV superscript π£β²π v^{\prime}\in V italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_V such that there is an edge (vβ²,v)βE superscript π£β²π£ πΈ(v^{\prime},v)\in E( italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT , italic_v ) β italic_E).
ASTs are language-specific, and can also vary from one implementation to another; we assume the AST construction is provided by the user; standard implementions exist for all programming languages since ASTs are required to compile or interpret programs. Figure0(a) depicts an examples of an AST for an SQL query. In our approach, we use the AST to restrict the space of prediction sets we consider. Intuitively, we consider prediction sets that can be represented by replacing some subtree of the AST with a βholeβ representing an arbitrary sequence of tokens.
3 PAC Prediction Sets for Code Generation
In this section, we formalize the notion of a PAC prediction set for code generation.
3.1 Partial Programs
Our algorithm takes as input a trained model f π f italic_f represented as a scoring function f:π³Γπ΄ββ:πβπ³ π΄ β f:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_f : caligraphic_X Γ caligraphic_Y β blackboard_R, along with a calibration dataset D π· D italic_D of input-output pairs Z={(x i,y i)}i=1 n π superscript subscript subscript π₯ π subscript π¦ π π 1 π Z={(x_{i},y_{i})}_{i=1}^{n}italic_Z = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We assume an algorithm that, given an input x π₯ x italic_x, uses f π f italic_f to produce a label y=fΒ―β’(x)π¦Β―π π₯ y=\bar{f}(x)italic_y = overΒ― start_ARG italic_f end_ARG ( italic_x )βe.g., we may compute y π¦ y italic_y using a greedy decoding strategy.
Our goal is to use f π f italic_f to construct a PAC prediction set. The challenge is that there are an enormous number of possible labels (exponential in the length of the generated sequence), meaning a traditional prediction set will not be human-interpretable. Thus, we need to constrain the search space over prediction set models.
To this end, we consider prediction sets defined by partial programs, which are sequences with special tokens called holes; each hole can be filled with an arbitrary subsequence, and filling all holes produces a complete sequence. Formally, we denote the special token by ?β’?????? ?; then, a partial program is a sequence yβ(Ξ£βͺ{?β’?})π¦ superscript Ξ£??y\in(\Sigma\cup{??})^{}italic_y β ( roman_Ξ£ βͺ { ? ? } ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Partial programs also require that holes occur in a way compatible with the context-free grammar defining the programming language; we describe this condition in Section3.2. Suppose that y π¦ y italic_y has k π k italic_k holes:
y=y 0β’?β’?β’y 1β’β¦β’y kβ1β’?β’?β’y k,π¦ subscript π¦ 0??subscript π¦ 1β¦subscript π¦ π 1??subscript π¦ π\displaystyle y=y_{0};??;y_{1}...;y_{k-1}??;y_{k},italic_y = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ? ? italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT β¦ italic_y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ? ? italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
where y 0,y 1,β¦,y kβΞ£subscript π¦ 0 subscript π¦ 1β¦subscript π¦ π superscript Ξ£ y_{0},y_{1},...,y_{k}\in\Sigma^{}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Then, given yΒ―1,β¦,yΒ―kβΞ£subscriptΒ―π¦ 1β¦subscriptΒ―π¦ π superscript Ξ£\bar{y}{1},...,\bar{y}{k}\in\Sigma^{}overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, we can fill the holes in y π¦ y italic_y using these sequences to obtain
yΒ―=fillβ’(y;yΒ―1,β¦,yΒ―k)=y 0β’yΒ―1β’y 1β’β¦β’y kβ1β’yΒ―kβ’y k,Β―π¦ fill π¦ subscriptΒ―π¦ 1β¦subscriptΒ―π¦ π subscript π¦ 0 subscriptΒ―π¦ 1 subscript π¦ 1β¦subscript π¦ π 1 subscriptΒ―π¦ π subscript π¦ π\displaystyle\bar{y}=\text{fill}(y;\bar{y}{1},...,\bar{y}{k})=y_{0}\bar{y}{% 1}y{1}...y_{k-1}\bar{y}{k}y{k},overΒ― start_ARG italic_y end_ARG = fill ( italic_y ; overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT β¦ italic_y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
i.e., replace the i π i italic_i th hole with yΒ―i subscriptΒ―π¦ π\bar{y}_{i}overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We call yΒ―βΣ¯π¦ superscript Ξ£\bar{y}\in\Sigma^{}overΒ― start_ARG italic_y end_ARG β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT a completion of y π¦ y italic_y; we denote the set of completions of y π¦ y italic_y by
Οβ’(y)β{yΒ―βΞ£β£βy 1,β¦,y kβΞ£.yΒ―=fillβ’(y;y 1,β¦,y k)},βπ π¦ conditional-setΒ―π¦ superscript Ξ£ formulae-sequence subscript π¦ 1β¦subscript π¦ π superscript Σ¯π¦ fill π¦ subscript π¦ 1β¦subscript π¦ π\displaystyle\pi(y)\coloneqq{\bar{y}\in\Sigma^{}\mid\exists y_{1},...,y_{k}% \in\Sigma^{},.,\bar{y}=\text{fill}(y;y_{1},...,y_{k})},italic_Ο ( italic_y ) β { overΒ― start_ARG italic_y end_ARG β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT β£ β italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT β roman_Ξ£ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT . overΒ― start_ARG italic_y end_ARG = fill ( italic_y ; italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } ,
i.e., all possible ways in which the holes in y π¦ y italic_y can be filled (note that Οβ’(y)π π¦\pi(y)italic_Ο ( italic_y ) may be an infinite set).
Now, our strategy is to restrict prediction sets to sets of programs represented by partial programs. At a high level, our algorithm starts with a concrete program yΒ―=fΒ―β’(x)Β―π¦Β―π π₯\bar{y}=\bar{f}(x)overΒ― start_ARG italic_y end_ARG = overΒ― start_ARG italic_f end_ARG ( italic_x ), and then replaces a certain number of subsequences in yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG with holes to obtain a partial program y π¦ y italic_y such that Οβ’(y)π π¦\pi(y)italic_Ο ( italic_y ) contains the true program ysuperscript π¦ y^{}italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with high probability.
3.2 Leveraging AST Structure
An additional constraint is that we want to remove subsequences that correspond to full βunits of codeβ. For example, if we start with the program print([1,2,3]), we might remove a subsequence to obtain print([??); however, this partial program can be hard for the user to understand since the brackets do not match. Instead, we may want to require the algorithm to either remove only the elements of the array to obtain print([??]), or remove the full array to obtain print(??). Thus, we only consider removing subsequences corresponding to nodes in the AST T=(V,E)π π πΈ T=(V,E)italic_T = ( italic_V , italic_E ) of yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARGβi.e., sequences yΒ―iβ’β¦β’yΒ―jβ1 subscriptΒ―π¦ πβ¦subscriptΒ―π¦ π 1\bar{y}{i}...\bar{y}{j-1}overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β¦ overΒ― start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT with a hole ?β’?????? ?, where (i,j)=Ξ·β’(v)π π π π£(i,j)=\eta(v)( italic_i , italic_j ) = italic_Ξ· ( italic_v ) for some vβV π£ π v\in V italic_v β italic_V.
3.3 Bounded Number of Holes
Even with the AST constraint, we can still obtain very complicated partial programs by removing a large number of leaf nodes. For example, we could remove three nodes from print([1,2,3]) to obtain the partial program print([??,??,??]); for longer lists, the resulting partial program would be even more complex. A simpler solution would be to leave the entire contents of the list as a single hole: print([??]). To this end, we impose a constraint that at most m π m italic_m subtrees of the AST are replaced with holes, resulting in a partial program with at most m π m italic_m holes. Here, mββ π β m\in\mathbb{N}italic_m β blackboard_N is a given hyperparameter; for instance, we may take m π m italic_m to be 1-3 holes. In our experiments, we find that m=1 π 1 m=1 italic_m = 1 works well in practice.
3.4 Problem Formulation
Our goal is to design an algorithm π π\mathcal{A}caligraphic_A that maps a validation set Z={(x i,y i)}i=1 n π superscript subscript subscript π₯ π subscript π¦ π π 1 π Z={(x_{i},y_{i})}_{i=1}^{n}italic_Z = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and a new input xβπ³ π₯ π³ x\in\mathcal{X}italic_x β caligraphic_X to a partial program y=πβ’(x;Z)π¦ π π₯ π y=\mathcal{A}(x;Z)italic_y = caligraphic_A ( italic_x ; italic_Z ) such that π π\mathcal{A}caligraphic_A satisfies the PAC property given in Definition2.2βi.e., we have
β pβ’(Z)β’[β pβ’(x,y)β’[yβΟβ’(πβ’(x;Z))]β₯1βΟ΅]β₯1βΞ΄,subscript β π π delimited-[]subscript β π π₯ π¦ delimited-[]π¦ π π π₯ π 1 italic-Ο΅ 1 πΏ\displaystyle\mathbb{P}{p(Z)}\left[\mathbb{P}{p(x,y)}\left[y\in\pi(\mathcal{% A}(x;Z))\right]\geq 1-\epsilon\right]\geq 1-\delta,blackboard_P start_POSTSUBSCRIPT italic_p ( italic_Z ) end_POSTSUBSCRIPT [ blackboard_P start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ italic_y β italic_Ο ( caligraphic_A ( italic_x ; italic_Z ) ) ] β₯ 1 - italic_Ο΅ ] β₯ 1 - italic_Ξ΄ ,
where pβ’(Z)=βi=1 n pβ’(x i,y i)π π superscript subscript product π 1 π π subscript π₯ π subscript π¦ π p(Z)=\prod_{i=1}^{n}p(x_{i},y_{i})italic_p ( italic_Z ) = β start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). As before, we additionally want to minimize the expected size of Οβ’(πβ’(x;Z))π π π₯ π\pi(\mathcal{A}(x;Z))italic_Ο ( caligraphic_A ( italic_x ; italic_Z ) ). As we describe in the next section, our algorithm constructs y=πβ’(x;Z)π¦ π π₯ π y=\mathcal{A}(x;Z)italic_y = caligraphic_A ( italic_x ; italic_Z ) by starting from the highest probability prediction yΒ―=argβ‘maxβ‘fβ’(x,y)Β―π¦ π π₯ π¦\bar{y}=\operatorname*{\arg\max}f(x,y)overΒ― start_ARG italic_y end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR italic_f ( italic_x , italic_y ) (more specifically, an approximate maximum based on greedy decoding), and then removes subtrees of the AST of yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG to obtain y π¦ y italic_y. Then, we define size to be the fraction of nodes retained in y π¦ y italic_yβi.e., Sβ’(y)=|Tβ²|/|T|π π¦ superscript πβ²π S(y)=|T^{\prime}|/|T|italic_S ( italic_y ) = | italic_T start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT | / | italic_T |, where Tβ²superscript πβ²T^{\prime}italic_T start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT is the AST of y π¦ y italic_y and T π T italic_T is the AST of yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG.
4 Structured Prediction Set Algorithm
We describe our algorithm π π\mathcal{A}caligraphic_A (in Algorithm1) for constructing structured PAC prediction sets for code generation.
4.1 General Strategy
Recall that existing approaches to constructing prediction sets rely on a 1D search space over parameters Οββ π β\tau\in\mathbb{R}italic_Ο β blackboard_R. Our approach is to design such a 1D search space and then apply existing prediction set algorithms. However, we cannot directly use the scoring function fβ’(x,yβ²)π π₯ superscript π¦β²f(x,y^{\prime})italic_f ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) to construct prediction sets, since its level sets {yβ²β£fβ’(x,yβ²)β₯Ο}conditional-set superscript π¦β²π π₯ superscript π¦β²π{y^{\prime}\mid f(x,y^{\prime})\geq\tau}{ italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β£ italic_f ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β₯ italic_Ο } may not have the desired structure described in section3βi.e., they may have not have form Οβ’(y)π π¦\pi(y)italic_Ο ( italic_y ) for some partial program y π¦ y italic_y.
Instead, we design a modified scoring function fβ’(x,yβ²)π π₯ superscript π¦β²\tilde{f}(x,y^{\prime})over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) whose level sets have the form {yβ²β£fβ’(x,yβ²)β₯Ο}=Οβ’(y)conditional-set superscript π¦β²π π₯ superscript π¦β²π π π¦{y^{\prime}\mid\tilde{f}(x,y^{\prime})\geq\tau}=\pi(y){ italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β£ over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β₯ italic_Ο } = italic_Ο ( italic_y ) for some partial program y π¦ y italic_y. To achieve this goal, we first note that for a single input x π₯ x italic_x, if our goal is to obtain a prediction set of the form Οβ’(y)π π¦\pi(y)italic_Ο ( italic_y ) for some partial program y π¦ y italic_y, we need fβ’(x,yβ²)>Οπ π₯ superscript π¦β²π\tilde{f}(x,y^{\prime})>\tau over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) > italic_Ο for all yβ²βΟβ’(y)superscript π¦β²π π¦ y^{\prime}\in\pi(y)italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y ) and fβ’(x,yβ²)<Οπ π₯ superscript π¦β²π\tilde{f}(x,y^{\prime})<\tau over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) < italic_Ο otherwise. For instance, we could define fβ’(x,yβ²)=Ο+Ξ³π π₯ superscript π¦β²π πΎ\tilde{f}(x,y^{\prime})=\tau+\gamma over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = italic_Ο + italic_Ξ³ for all yβ²βΟβ’(y)superscript π¦β²π π¦ y^{\prime}\in\pi(y)italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y ) (for an arbitrarily small Ξ³ββ>0 πΎ subscript β absent 0\gamma\in\mathbb{R}_{>0}italic_Ξ³ β blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT), and fβ’(x,yβ²)=ΟβΞ³π π₯ superscript π¦β²π πΎ\tilde{f}(x,y^{\prime})=\tau-\gamma over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = italic_Ο - italic_Ξ³ otherwise.
More generally, our strategy can be envisioned as follows:
- Design an algorithm π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG such that for every input x π₯ x italic_x and parameter Ο π\tau italic_Ο, it outputs a partial program y=π
β’(x,Ο)π¦π π₯ π y=\tilde{\mathcal{A}}(x,\tau)italic_y = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) that corresponds to a desired prediction set Οβ’(y)π π¦\pi(y)italic_Ο ( italic_y ).
- Design an algorithm π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG such that for every input x π₯ x italic_x and parameter Ο π\tau italic_Ο, it outputs a partial program y=π
- Construct a modified scoring function f~~π\tilde{f}over~ start_ARG italic_f end_ARG such that {yβ²β£f
β’(x,yβ²)β₯Ο}=Οβ’(y)conditional-set superscript π¦β²π π₯ superscript π¦β²π π π¦{y^{\prime}\mid\tilde{f}(x,y^{\prime})\geq\tau}=\pi(y){ italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β£ over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β₯ italic_Ο } = italic_Ο ( italic_y ) for some y π¦ y italic_y.
- Construct a modified scoring function f~~π\tilde{f}over~ start_ARG italic_f end_ARG such that {yβ²β£f
- Use an existing PAC prediction set algorithm to choose a threshold Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ) based on f~~π\tilde{f}over~ start_ARG italic_f end_ARG.
The key constraint on ππ\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG is that we have to be able to construct fπ\tilde{f}over~ start_ARG italic_f end_ARG. We saw that we can construct fπ\tilde{f}over~ start_ARG italic_f end_ARG for a single triple (x,Ο,y)π₯ π π¦(x,\tau,y)( italic_x , italic_Ο , italic_y ), but given multiple triples, such a fπ\tilde{f}over~ start_ARG italic_f end_ARG may not exist.
For f~~π\tilde{f}over~ start_ARG italic_f end_ARG to exist, these triples must satisfy monotonicity_βi.e., for two parameters Ο,Οβ²ββ π superscript πβ²β\tau,\tau^{\prime}\in\mathbb{R}italic_Ο , italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β blackboard_R such that Οβ€Οβ²π superscript πβ²\tau\leq\tau^{\prime}italic_Ο β€ italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT, the corresponding prediction sets satisfy F Οβ’(x)βF Οβ²β’(x)subscript πΉ superscript πβ²π₯ subscript πΉ π π₯ F{\tau}(x)\supseteq F_{\tau^{\prime}}(x)italic_F start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x ) β italic_F start_POSTSUBSCRIPT italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ). In other words, as the threshold on the score decreases, the size of the prediction set becomes larger uniformly for all inputs xβπ³ π₯ π³ x\in\mathcal{X}italic_x β caligraphic_X. Thus, we need to ensure that our partial programs also satisfy this propertyβi.e., if Οβ€Οβ²π superscript πβ²\tau\leq\tau^{\prime}italic_Ο β€ italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT, then letting y=πβ’(x,Ο)π¦π π₯ π y=\tilde{\mathcal{A}}(x,\tau)italic_y = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) and yβ²=πβ’(xβ²,Ο)superscript π¦β²π superscript π₯β²π y^{\prime}=\tilde{\mathcal{A}}(x^{\prime},\tau)italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT = over~ start_ARG caligraphic_A end_ARG ( italic_x start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT , italic_Ο ), we have Οβ’(y)βΟβ’(yβ²)π superscript π¦β²π π¦\pi(y)\supseteq\pi(y^{\prime})italic_Ο ( italic_y ) β italic_Ο ( italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ).
We impose a stronger constraint that implies monotonicity: if Οβ€Οβ²π superscript πβ²\tau\leq\tau^{\prime}italic_Ο β€ italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT, then we require that every node in y=πβ’(x,Ο)π¦π π₯ π y=\tilde{\mathcal{A}}(x,\tau)italic_y = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) also occurs in yβ²=πβ’(x,Οβ²)superscript π¦β²π π₯ superscript πβ²y^{\prime}=\tilde{\mathcal{A}}(x,\tau^{\prime})italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ). It is easy to see that if the AST of y π¦ y italic_y contains a subset of the nodes in the AST of yβ²superscript π¦β²y^{\prime}italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT, then Οβ’(y)βΟβ’(yβ²)π superscript π¦β²π π¦\pi(y)\supseteq\pi(y^{\prime})italic_Ο ( italic_y ) β italic_Ο ( italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ). For simplicity, we refer to this stronger constraint as monotonicity for the remainder of the paper.
Theorem 4.1.
Assume that πnormal-π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG is monotone. There exists a scoring function fnormal-π\tilde{f}over~ start_ARG italic_f end_ARG such that for all xβπ³ π₯ π³ x\in\mathcal{X}italic_x β caligraphic_X, we have
Οβ’(πβ’(x,Ο))=FΟβ’(x)β{yβ²β£fβ’(x,yβ²)β₯Ο}ππ π₯ π subscriptπΉ π π₯βconditional-set superscript π¦β²π π₯ superscript π¦β²π\displaystyle\pi(\tilde{\mathcal{A}}(x,\tau))=\tilde{F}_{\tau}(x)\coloneqq{y^% {\prime}\mid\tilde{f}(x,y^{\prime})\geq\tau}italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) ) = over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_Ο end_POSTSUBSCRIPT ( italic_x ) β { italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β£ over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β₯ italic_Ο }
for all Οββ π β\tau\in\mathbb{R}italic_Ο β blackboard_R except a finite subset.
Proof.
Recall that we have assumed that we choose y π¦ y italic_y to be a subset of the most likely program yΒ―=argβ‘max yβ²β‘fβ’(x,y)Β―π¦ subscript superscript π¦β²π π₯ π¦\bar{y}=\operatorname*{\arg\max}{y^{\prime}}f(x,y)overΒ― start_ARG italic_y end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x , italic_y ) (i.e., remove some subset of nodes in yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG). Thus, the space of possible partial programs y π¦ y italic_y for a given input x π₯ x italic_x is finite. Suppose the partial programs encountered by enumerating Ο π\tau italic_Ο from +β+\infty+ β to ββ-\infty- β is y 1,β¦,y k subscript π¦ 1β¦subscript π¦ π y{1},...,y_{k}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; this chain must be finite due to monotonicity. Also, Οβ’(y 1)βΟβ’(y 2)ββ¦βΟβ’(y k)superset-of-or-equals π subscript π¦ 1 π subscript π¦ 2 superset-of-or-equalsβ¦superset-of-or-equals π subscript π¦ π\pi(y_{1})\supseteq\pi(y_{2})\supseteq...\supseteq\pi(y_{k})italic_Ο ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) β italic_Ο ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) β β¦ β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
Next, let the value of Ο π\tau italic_Ο at which πβ’(x,Ο)π π₯ π\tilde{\mathcal{A}}(x,\tau)over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) changes from y iβ1 subscript π¦ π 1 y_{i-1}italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and define Ο 1=ββsubscript π 1\tau_{1}=-\infty italic_Ο start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - β and Ο k+1=+βsubscript π π 1\tau_{k+1}=+\infty italic_Ο start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = + β. Then, we have πβ’(x,Ο)=y iπ π₯ π subscript π¦ π\tilde{\mathcal{A}}(x,\tau)=y_{i}over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for Ο i<Ο<Ο i+1 subscript π π π subscript π π 1\tau_{i}<\tau<\tau_{i+1}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_Ο < italic_Ο start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (the value at Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be either y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or y iβ1 subscript π¦ π 1 y_{i-1}italic_y start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT). Now, we can define
fβ’(x,yβ²)=Ο i where yβ²βΟβ’(y i)β§yβ²βΟβ’(y i+1),formulae-sequenceπ π₯ superscript π¦β²subscript π π where superscript π¦β²π subscript π¦ π superscript π¦β²π subscript π¦ π 1\displaystyle\tilde{f}(x,y^{\prime})=\tau_{i}\quad\text{where}\quad y^{\prime}% \in\pi(y_{i})\wedge y^{\prime}\not\in\pi(y_{i+1}),over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) β§ italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ,
noting that by monotonicity, the condition holds for at most one i π i italic_i; if it does not hold for any i π i italic_i, but yβ²βΟβ’(y k)superscript π¦β²π subscript π¦ π y^{\prime}\in\pi(y_{k})italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), then we take fβ’(x,yβ²)=βπ π₯ superscript π¦β²\tilde{f}(x,y^{\prime})=\infty over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = β; otherwise, we take fβ’(x,yβ²)=ββπ π₯ superscript π¦β²\tilde{f}(x,y^{\prime})=-\infty over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) = - β.3 3 3 If we assume that y 1=?β’?subscript π¦ 1??y_{1}=;??italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ? ? is the partial program consisting of a single hole, yβ²βΟβ’(y 1)superscript π¦β²π subscript π¦ 1 y^{\prime}\in\pi(y_{1})italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β italic_Ο ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for all yβ²superscript π¦β²y^{\prime}italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT, so this last case is unnecessary. This strategy ensures that the scores fβ’(x,yβ²)π π₯ superscript π¦β²\tilde{f}(x,y^{\prime})over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) fall between the Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is easy to see that with this choice of f~~π\tilde{f}over~ start_ARG italic_f end_ARG, we have
Οβ’(y i)={yβ²β£fβ’(x,yβ²)β₯Ο}π subscript π¦ π conditional-set superscript π¦β²π π₯ superscript π¦β²π\displaystyle\pi(y_{i})={y^{\prime}\mid\tilde{f}(x,y^{\prime})\geq\tau}italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT β£ over~ start_ARG italic_f end_ARG ( italic_x , italic_y start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β₯ italic_Ο }
for all Οβββ{Ο 1,β¦,Ο k}π β subscript π 1β¦subscript π π\tau\in\mathbb{R}\setminus{\tau_{1},...,\tau_{k}}italic_Ο β blackboard_R β { italic_Ο start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_Ο start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. By construction, we have πβ’(x,Ο)β{y i}i=1 kπ π₯ π superscript subscript subscript π¦ π π 1 π\tilde{\mathcal{A}}(x,\tau)\in{y_{i}}_{i=1}^{k}over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) β { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, so the claim follows. β
In our algorithm, can avoid problematic values of Ο π\tau italic_Ο (i.e., the finite subset excluded in the statement of Theorem4.1) simply by reducing Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ) by a tiny amountβi.e., we use Ο^β’(Z)βΞ³^π π πΎ\hat{\tau}(Z)-\gamma over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ for an arbitrarily small Ξ³ββ>0 πΎ subscript β absent 0\gamma\in\mathbb{R}_{>0}italic_Ξ³ β blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT.
Corollary 4.2.
For sufficiently small Ξ³ββ>0 πΎ subscript β absent 0\gamma\in\mathbb{R}_{>0}italic_Ξ³ β blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, the prediction set Οβ’(πβ’(x,Ο^β’(Z)βΞ³))π normal-π π₯ normal-^π π πΎ\pi(\tilde{\mathcal{A}}(x,\hat{\tau}(Z)-\gamma))italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ ) ) is PAC, where Ο^β’(Z)normal-^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ) is obtained by using an existing PAC prediction set algorithm with fnormal-π\tilde{f}over~ start_ARG italic_f end_ARG.
Proof.
Note that for sufficiently small Ξ³ πΎ\gamma italic_Ξ³, either Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ) or Ο^β’(Z)βΞ³^π π πΎ\hat{\tau}(Z)-\gamma over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ is not problematic. If the former holds, then
Οβ’(πβ’(x,Ο^β’(Z)βΞ³))βΟβ’(πβ’(x,Ο^β’(Z)))=FΟ^β’(Z)β’(x).superset-of-or-equals ππ π₯^π π πΎ ππ π₯^π π subscriptπΉ^π π π₯\displaystyle\pi(\tilde{\mathcal{A}}(x,\hat{\tau}(Z)-\gamma))\supseteq\pi(% \tilde{\mathcal{A}}(x,\hat{\tau}(Z)))=\tilde{F}_{\hat{\tau}(Z)}(x).italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ ) ) β italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) ) ) = over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_Ο end_ARG ( italic_Z ) end_POSTSUBSCRIPT ( italic_x ) .
If the latter holds, then
Οβ’(πβ’(x,Ο^β’(Z)βΞ³))=FΟ^β’(Z)βΞ³β’(x)βFΟ^β’(Z)β’(x).ππ π₯^π π πΎ subscriptπΉ^π π πΎ π₯ superset-of-or-equals subscriptπΉ^π π π₯\displaystyle\pi(\tilde{\mathcal{A}}(x,\hat{\tau}(Z)-\gamma))=\tilde{F}{\hat{% \tau}(Z)-\gamma}(x)\supseteq\tilde{F}{\hat{\tau}(Z)}(x).italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ ) ) = over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ end_POSTSUBSCRIPT ( italic_x ) β over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_Ο end_ARG ( italic_Z ) end_POSTSUBSCRIPT ( italic_x ) .
By assumption, FΟ^β’(Z)subscriptπΉ^π π\tilde{F}_{\hat{\tau}(Z)}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_Ο end_ARG ( italic_Z ) end_POSTSUBSCRIPT is PAC, so the claim follows. β
As a consequence, we can use Οβ’(πβ’(x,Ο^β’(Z)βΞ³))ππ π₯^π π πΎ\pi(\tilde{\mathcal{A}}(x,\hat{\tau}(Z)-\gamma))italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ ) ) as our PAC prediction set. It remains to describe our design of π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG.
4.2 Probabilities for ASTs
First, we describe the objective that our choice of π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG uses to prioritize partial programs. A standard strategy for ranking candidate prediction sets is to consider the probability mass of labels in the set(Angelopoulos et al., 2020). Then, we can prioritize prediction sets that cover higher probability mass, since they are more likely to contain the true label.
In our setting, the corresponding principle is to prioritize replacing portions of the program that are more uncertain (according to the original scoring function f π f italic_f) with holes, since these portions are the most likely to be incorrect compared to the true program. In particular, since the corresponding prediction set is constructed by filling holes with all possible subsequences, and since low-probability subtrees are the ones where other subsequences have higher probability, so replacing these subtrees with holes most increases the aggregate probability mass of the prediction set.
To formalize this notion, we assume that the AST of the top prediction yΒ―=argβ‘max yβ‘fβ’(x,y)Β―π¦ subscript π¦ π π₯ π¦\bar{y}=\operatorname*{\arg\max}{y}f(x,y)overΒ― start_ARG italic_y end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_f ( italic_x , italic_y ) has probabilities associated with its leaf nodes.4 4 4 Here, we assume access to the AST T π T italic_T, which is the case for all major languages (since the ability to construct T π T italic_T is needed to interpret or compile the program yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG). In addition, we assume that yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG is valid; in our experiments, we found that invalid code or unparseable output appears in less than 0.1% of cases. When an invalid AST is produced, we can simply take additional samples; alternatively, techniques like PICARD impose constraints during generation to ensure that only valid programs are decoded. In particular, we assume that each leaf node v π£ v italic_v in the AST T π T italic_T of yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG is labeled with a value β vβββ₯0 subscript β π£ subscript β absent 0\ell{v}\in\mathbb{R}{\geq 0}roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT β blackboard_R start_POSTSUBSCRIPT β₯ 0 end_POSTSUBSCRIPT denoting the negative log probability of v π£ v italic_v conditioned on its ancestors 5 5 5 This probability model assumes that nodes are generated conditioned only on their ancestors. In practice, we often use sequence models that do not respect the structure of T π T italic_T, but as a heuristic, we can still use the predicted probability of the token labeling each leaf v π£ v italic_v to construct β v subscript β π£\ell{v}roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Because the scoring function can be arbitrary, this heuristic does not invalidate our coverage guarantee.βi.e., β v=βlogβ‘pβ’(vβ£v 1,β¦,v k)subscript β π£ π conditional π£ subscript π£ 1β¦subscript π£ π\ell_{v}=-\log p(v\mid v_{1},...,v_{k})roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = - roman_log italic_p ( italic_v β£ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , β¦ , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Then, the negative log probability of the entire AST T π T italic_T is
β T=βvβT β v.subscript β π subscript π£ π subscript β π£\displaystyle\ell_{T}=\sum_{v\in T}\ell_{v}.roman_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = β start_POSTSUBSCRIPT italic_v β italic_T end_POSTSUBSCRIPT roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT .
Now, if we replace a subtree with holes, we are considering enumerating all possible subtrees to fill that hole construct the prediction set. Thus, the aggregate negative log probability of that subtree goes from βvβsubtree β v subscript π£ subtree subscript β π£\sum_{v\in\text{subtree}}\ell_{v}β start_POSTSUBSCRIPT italic_v β subtree end_POSTSUBSCRIPT roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to 0 0 (i.e., its probability becomes 1 1 1 1). That is, if we replace a subtree with a hole, we can simply drop those nodes from the sum in β T subscript β π\ell_{T}roman_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.
Thus, we can label holes v π£ v italic_v in an AST Tβ²superscript πβ²T^{\prime}italic_T start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT with negative log probability β v=1 subscript β π£ 1\ell_{v}=1 roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1. Then, we can extend the definition for the negative log probability of a tree to trees with holes:
β Tβ²=βvβTβ²β v.subscript β superscript πβ²subscript π£ superscript πβ²subscript β π£\displaystyle\ell_{T^{\prime}}=\sum_{v\in T^{\prime}}\ell_{v}.roman_β start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = β start_POSTSUBSCRIPT italic_v β italic_T start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT .
We use this objective to prioritize partial programs.
(a)Fraction of nodes removed as a function of Ο΅ italic-Ο΅\epsilon italic_Ο΅ for varying bound m π m italic_m on the number of holes.
(b)Prediction set coverage rate as a function of Ο΅ italic-Ο΅\epsilon italic_Ο΅. The coverage always exceeds the desired 1βΟ΅ 1 italic-Ο΅ 1-\epsilon 1 - italic_Ο΅ rate (dotted line).
Figure 2: Node removal and code set coverage for varying number of holes and varying Ο΅ italic-Ο΅\epsilon italic_Ο΅ values. The experiment is conducted using the PICARD(Scholak et al., 2021) model evaluated on the SQL dataset(Yu et al., 2018).
(a)Fraction of nodes removed as a function of Ο΅ italic-Ο΅\epsilon italic_Ο΅ for varying bound m π m italic_m on the number of holes.
(b)Prediction set coverage as a function of Ο΅ italic-Ο΅\epsilon italic_Ο΅. The coverage always exceeds the desired 1βΟ΅ 1 italic-Ο΅ 1-\epsilon 1 - italic_Ο΅ rate (dotted line).
Figure 3: Node removal and code set coverage for varying number of holes and varying Ο΅ italic-Ο΅\epsilon italic_Ο΅ values. The experiment is conducted using OpenAIβs Codex Model(Chen et al., 2021) evaluated on the Python datasets (Hendrycks et al., 2021; Chen et al., 2021).
4.3 Computing Partial Programs via Optimization
Next, we describe our algorithm ππ\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG for constructing a partial program y=ππ\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG is formulated as an integer linear program (ILP), which we describe below.β’(x,Ο)π¦π π₯ π y=\tilde{\mathcal{A}}(x,\tau)italic_y = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) representing a prediction set for a given input x π₯ x italic_x and parameter Ο π\tau italic_Ο. Rather than compute this output for all possible Ο π\tau italic_Ο, we fix a finite subset of parameters Ο 1<Ο 2<β¦<Ο k subscript π 1 subscript π 2β¦subscript π π\tau_{1}<\tau_{2}<...<\tau_{k}italic_Ο start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_Ο start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < β¦ < italic_Ο start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and construct partial programs y 1,y 2,β¦,y k subscript π¦ 1 subscript π¦ 2β¦subscript π¦ π y_{1},y_{2},...,y_{k}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , β¦ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for these values. Then, we take πβ’(x,Ο)=y iπ π₯ π subscript π¦ π\tilde{\mathcal{A}}(x,\tau)=y_{i}over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Οβ(Ο iβ1,Ο i]π subscript π π 1 subscript π π\tau\in(\tau_{i-1},\tau_{i}]italic_Ο β ( italic_Ο start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. Then, π
Optimization variables. Our program includes two sets of variables. First, for each node v π£ v italic_v in T π T italic_T, where T π T italic_T is the AST for yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG, and for each parameter Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we include a binary variable Ξ± i,vβ{0,1}subscript πΌ π π£ 0 1\alpha_{i,v}\in{0,1}italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β { 0 , 1 } indicating whether we remove the subtree rooted at v π£ v italic_v from yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG to construct y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Second, for each node v π£ v italic_v, we include a binary variable Ξ² i,vβ{0,1}subscript π½ π π£ 0 1\beta_{i,v}\in{0,1}italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β { 0 , 1 } indicating whether we remove v π£ v italic_v from yΒ―Β―π¦\bar{y}overΒ― start_ARG italic_y end_ARG to construct y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We separately track removing subtrees from removing nodes so we can impose our bound on the number of subtrees removed.
Algorithm 1 Our structured PAC prediction set algorithm.
procedure PredictionSet(Scoring function
f π f italic_f , Validation dataset
Z π Z italic_Z , Confidence levels
Ο΅,Ξ΄ italic-Ο΅ πΏ\epsilon,\delta italic_Ο΅ , italic_Ξ΄ )
for
(x,yΒ―)βZ π₯Β―π¦ π(x,\bar{y})\in Z( italic_x , overΒ― start_ARG italic_y end_ARG ) β italic_Z do
Ξ±,Ξ²ββπΌ π½ absent\alpha,\beta\leftarrow italic_Ξ± , italic_Ξ² β Solve Eqs.3-9 for
(x,y)π₯ π¦(x,y)( italic_x , italic_y )
y iββsubscript π¦ π absent y_{i}\leftarrow italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β Remove nodes
v π£ v italic_v such that
Ξ± i,v=1 subscript πΌ π π£ 1\alpha_{i,v}=1 italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT = 1
return
Ο isubscript π superscript π\tau_{i^{}}italic_Ο start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , where
isuperscript π i^{}italic_i start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the largest
i π i italic_i such that
β(x,yΒ―)βZ πβ’(yΒ―βΟβ’(y i))β€kβ’(Ο΅,Ξ΄,|Z|)subscript π₯Β―π¦ π 1Β―π¦ π subscript π¦ π π italic-Ο΅ πΏ π\displaystyle\sum_{(x,\bar{y})\in Z}\mathbbm{1}(\bar{y}\not\in\pi(y_{i}))\leq k% (\epsilon,\delta,|Z|)β start_POSTSUBSCRIPT ( italic_x , overΒ― start_ARG italic_y end_ARG ) β italic_Z end_POSTSUBSCRIPT blackboard_1 ( overΒ― start_ARG italic_y end_ARG β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) β€ italic_k ( italic_Ο΅ , italic_Ξ΄ , | italic_Z | )
Overall program. Overall, our goal is to minimize the leaf nodes removed on average across y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:
min Ξ±,Ξ²β’βi=1 kβvβT Ξ² i,v subscript πΌ π½ superscript subscript π 1 π subscript π£ π subscript π½ π π£\displaystyle\min_{\alpha,\beta}~{}\sum_{i=1}^{k}\sum_{v\in T}\beta_{i,v}roman_min start_POSTSUBSCRIPT italic_Ξ± , italic_Ξ² end_POSTSUBSCRIPT β start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT β start_POSTSUBSCRIPT italic_v β italic_T end_POSTSUBSCRIPT italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT(3)
subject to the following constraints:
βvβV Ξ± i,vβ€m(βiβ[k])subscript π£ π subscript πΌ π π£ π for-all π delimited-[]π\displaystyle\sum_{v\in V}\alpha_{i,v}\leq m\qquad(\forall i\in[k])β start_POSTSUBSCRIPT italic_v β italic_V end_POSTSUBSCRIPT italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β€ italic_m ( β italic_i β [ italic_k ] )(4)
Ξ± i,vβΞ² i,v(βvβV,iβ[k])βsubscript πΌ π π£ subscript π½ π π£ formulae-sequence for-all π£ π π delimited-[]π\displaystyle\alpha_{i,v}\to\beta_{i,v}\qquad(\forall v\in V,{}i\in[k])italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT ( β italic_v β italic_V , italic_i β [ italic_k ] )(5)
Ξ² i,vβΞ² i,vβ²(β(v,vβ²)βE)βsubscript π½ π π£ subscript π½ π superscript π£β²for-all π£ superscript π£β²πΈ\displaystyle\beta_{i,v}\to\beta_{i,v^{\prime}}\qquad(\forall(v,v^{\prime})\in E)italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( β ( italic_v , italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT ) β italic_E )(6)
Ξ² i,vβΞ± i,vβ¨Ξ² i,vβ²(whereβ’(vβ²,v)βE)βsubscript π½ π π£ subscript πΌ π π£ subscript π½ π superscript π£β²where superscript π£β²π£ πΈ\displaystyle\beta_{i,v}\rightarrow\alpha_{i,v}\vee\beta_{i,v^{\prime}}\qquad(% \text{where }(v^{\prime},v)\in E)italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β¨ italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( where ( italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT , italic_v ) β italic_E )(7)
Ξ² i,vβΞ² i+1,v(vβV,βiβ{2,β¦,k})βsubscript π½ π π£ subscript π½ π 1 π£ formulae-sequence π£ π for-all π 2β¦π\displaystyle\beta_{i,v}\rightarrow\beta_{i+1,v}\qquad(v\in V,{}\forall i\in% {2,...,k})italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT β italic_Ξ² start_POSTSUBSCRIPT italic_i + 1 , italic_v end_POSTSUBSCRIPT ( italic_v β italic_V , β italic_i β { 2 , β¦ , italic_k } )(8)
βvβT β vβ
(1βΞ² i,v)β€Ο i(βiβ[k])subscript π£ πβ
subscript β π£ 1 subscript π½ π π£ subscript π π for-all π delimited-[]π\displaystyle\sum_{v\in T}\ell_{v}\cdot(1-\beta_{i,v})\leq\tau_{i}\qquad(% \forall i\in[k])β start_POSTSUBSCRIPT italic_v β italic_T end_POSTSUBSCRIPT roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT β
( 1 - italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT ) β€ italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( β italic_i β [ italic_k ] )(9)
We have used Boolean notation for clarity; we can enforce Ξ±βΞ²βπΌ π½\alpha\to\beta italic_Ξ± β italic_Ξ² for Ξ±,Ξ²β{0,1}πΌ π½ 0 1\alpha,\beta\in{0,1}italic_Ξ± , italic_Ξ² β { 0 , 1 } via the constraint Ξ±β€Ξ² πΌ π½\alpha\leq\beta italic_Ξ± β€ italic_Ξ². We describe these constraints in detail below. Once we have solved this program, our algorithm directly constructs y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by removing all subtrees for which Ξ± i,v=1 subscript πΌ π π£ 1\alpha_{i,v}=1 italic_Ξ± start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT = 1.
Removal constraints. We include constraints bounding how many subtrees we remove, and enforcing that if we remove a subtree, we remove all nodes in that subtree:
β’ Eq.4: We remove at most m π m italic_m subtrees.
β’ Eq.5: If we remove the subtree at v π£ v italic_v, then we remove v π£ v italic_v.
β’ Eq.6: If we remove v π£ v italic_v and vβ²superscript π£β²v^{\prime}italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT is a child of v π£ v italic_v, then we also remove vβ²superscript π£β²v^{\prime}italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT.
β’ Eq.7: If we remove v π£ v italic_v, then we either remove the subtree at v π£ v italic_v or we remove its parent vβ²superscript π£β²v^{\prime}italic_v start_POSTSUPERSCRIPT β² end_POSTSUPERSCRIPT.
β’ Eq.8: If we remove v π£ v italic_v for y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then we also remove it for y i+1 subscript π¦ π 1 y_{i+1}italic_y start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (i.e., enforce monotonicity).
Probability mass constraint. Finally, Eq.9 ensures that we remove sufficient probability mass from T π T italic_T so Οβ’(y i)π subscript π¦ π\pi(y_{i})italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) meets the Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT thresholdβin particular, it requires that the negative log likelihood of the AST T i subscript π π T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is below Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that we use 1βΞ² i,v 1 subscript π½ π π£ 1-\beta_{i,v}1 - italic_Ξ² start_POSTSUBSCRIPT italic_i , italic_v end_POSTSUBSCRIPT since we are summing over nodes remaining in T i subscript π π T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Also, note that β v subscript β π£\ell_{v}roman_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a real-valued constant.
4.4 Structured PAC prediction sets.
Given the partial programs y i=πβ’(x,Ο i)subscript π¦ ππ π₯ subscript π π y_{i}=\tilde{\mathcal{A}}(x,\tau_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG caligraphic_A end_ARG ( italic_x , italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) computed using ππ\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG, the remaining question is how we actually choose the parameter Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ). We could construct fπ\tilde{f}over~ start_ARG italic_f end_ARG as described in the proof of Theorem4.1; then, by Corollary4.2, Οβ’(πβ’(x,Ο^β’(Z)βΞ³))ππ π₯^π π πΎ\pi(\tilde{\mathcal{A}}(x,\hat{\tau}(Z)-\gamma))italic_Ο ( over~ start_ARG caligraphic_A end_ARG ( italic_x , over^ start_ARG italic_Ο end_ARG ( italic_Z ) - italic_Ξ³ ) ) is PAC for sufficiently small Ξ³ πΎ\gamma italic_Ξ³.6 6 6 In fact, we can take Ξ³=min iβ‘|Ο iβΟ i+1|πΎ subscript π subscript π π subscript π π 1\gamma=\min_{i}|\tau_{i}-\tau_{i+1}|italic_Ξ³ = roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Ο start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT |, where Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are our choices in the design of π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG (not the proof of Theorem4.1).
However, constructing fπ\tilde{f}over~ start_ARG italic_f end_ARG can be computationally intractable. To avoid this issue, note that we do not need to explicitly construct fπ\tilde{f}over~ start_ARG italic_f end_ARG; instead, it suffices for such a f~~π\tilde{f}over~ start_ARG italic_f end_ARG to exist (as long as we can find a way to compute the parameter Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z )).
To compute Ο^β’(Z)^π π\hat{\tau}(Z)over^ start_ARG italic_Ο end_ARG ( italic_Z ), it suffices to be able to compute the number of errors in the validation set for a candidate value of Ο π\tau italic_Ο. Then, we can solve the maximization problem over Ο π\tau italic_Ο in (2) by enumerating over the fixed choices of parameters Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used by π~~π\tilde{\mathcal{A}}over~ start_ARG caligraphic_A end_ARG; since the other values of Ο π\tau italic_Ο produce the same prediction sets, we can ignore them.
Given a validation example (x,y)βZ π₯ π¦ π(x,y)\in Z( italic_x , italic_y ) β italic_Z, computing πβ’(yβF Ο iβ’(x))1 π¦ subscript πΉ subscript π π π₯\mathbbm{1}(y\in F_{\tau_{i}}(x))blackboard_1 ( italic_y β italic_F start_POSTSUBSCRIPT italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) is straightforwardβsince F Ο iβ’(x)=Οβ’(y i)subscript πΉ subscript π π π₯ π subscript π¦ π F_{\tau_{i}}(x)=\pi(y_{i})italic_F start_POSTSUBSCRIPT italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we simply check whether yβΟβ’(y i)π¦ π subscript π¦ π y\in\pi(y_{i})italic_y β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Doing so is a straightforward tree comparison operationβi.e., we enumerate nodes in y i subscript π¦ π y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (excluding holes) and check if they are all contained in y π¦ y italic_y; if so, then yβΟβ’(y i)π¦ π subscript π¦ π y\in\pi(y_{i})italic_y β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and if not, then yβΟβ’(y i)π¦ π subscript π¦ π y\not\in\pi(y_{i})italic_y β italic_Ο ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).
In Algorithm1, this search is implemented at the end: it returns the largest Ο i subscript π π\tau_{i}italic_Ο start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that achieves a given number of errors kβ’(Ο΅,Ξ΄,n)π italic-Ο΅ πΏ π k(\epsilon,\delta,n)italic_k ( italic_Ο΅ , italic_Ξ΄ , italic_n ) on the validation dataset, where
kβ’(Ο΅,Ξ΄,n)=max kβββͺ{0}β‘kβ’subj. toβ’βi=0 k(n i)β’Ο΅ iβ’(1βΟ΅)nβi<Ξ΄ π italic-Ο΅ πΏ π subscript π β 0 π subj. to superscript subscript π 0 π binomial π π superscript italic-Ο΅ π superscript 1 italic-Ο΅ π π πΏ\displaystyle k(\epsilon,\delta,n)=\max_{k\in\mathbb{N}\cup{0}}k{}\text{% subj. to}{}\sum_{i=0}^{k}\binom{n}{i}\epsilon^{i}(1-\epsilon)^{n-i}<\delta italic_k ( italic_Ο΅ , italic_Ξ΄ , italic_n ) = roman_max start_POSTSUBSCRIPT italic_k β blackboard_N βͺ { 0 } end_POSTSUBSCRIPT italic_k subj. to β start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_i end_ARG ) italic_Ο΅ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 - italic_Ο΅ ) start_POSTSUPERSCRIPT italic_n - italic_i end_POSTSUPERSCRIPT < italic_Ξ΄
is the number of mistakes permitted by an existing PAC prediction set algorithm(Park et al., 2020).
5 Experiments
We evaluate our proposed approach on two tasks: semantic parsing for SQL, and Python code completion.
SQL generation. In this task, the input to the model is a natural language utterance and the target is an SQL program. For the scoring function f π f italic_f, we use a state-of-the-art model PICARD(Scholak et al., 2021), which modifies the decoding process of T5(Raffel et al., 2019) to constrain decoding to valid SQL programs. This model is trained on the Spider dataset(Yu et al., 2018), a large multi-domain and cross-database dataset for SQL semantic parsing. For our experiments, we use 7,000 examples from Spider to construct prediction sets.
Python generation. In this task, the input to the model is a natural language problem statement combined with some k π k italic_k lines of code, and the target is the remaining lines of code to complete the solution. We use the Codex(Chen et al., 2021), a GPT language model fine-tuned on publicly available GitHub code with proficiency in over a dozen programming languages including Python, JavaScript, Go, TypeScript, and C#. For our experiments, we use natural language to Python code datasets including APPS(Hendrycks et al., 2021) and HumanEval: Hand-Written Evaluation Set(Chen et al., 2021).
Baseline. We compare to a baseline strategy that greedily chooses the least likely node to remove at each step. More precisely, it uses the same high-level strategy as our approachβi.e., construct a sequence of partial programs where each one contains the previous, and then use an existing prediction set algorithm to choose Ο π\tau italic_Ο. The difference is in how the sequence of partial programs is constructedβwhereas our approach uses optimization to do so, the baseline uses a greedy strategy. In particular, among all nodes of the current partial program with at least one child missing, it chooses to remove the one that most reduces the negative log likelihood (NLL) of the partial program (normalized by the number of nodes removed).
(a)Results for Picard on the Spider dataset.
(b)Results for Codex on the APPS dataset.
Figure 4: Coverage as a function of Ο΅ italic-Ο΅\epsilon italic_Ο΅ for the top-k π k italic_k predictions by the base model. By construction, the coverage does not depend on Ο΅ italic-Ο΅\epsilon italic_Ο΅.
Hyperparameters. The main hyperparameter is the choice of search space over Ο π\tau italic_Ο; we used the simple and natural choice of covering uniformly in increments of 0.01. In general, we found that the results are not overly sensitive to the choice of thresholds, as long as they largely covered the possible choices (with the tradeoff that too few choices can lead to suboptimality, whereas too many can increase conservativeness since the search space is larger). In addition, we choose Ξ΄=0.01 πΏ 0.01\delta=0.01 italic_Ξ΄ = 0.01; we vary Ο΅ italic-Ο΅\epsilon italic_Ο΅ in our results.
Results. We implemented the structured prediction set algorithm described in section4 to construct PAC prediction sets of code leveraging the abstract syntax tree of the SQL and Python programming languages. The results in Figure1(a)&2(a) show the fraction of nodes removed from the predicted AST for varying Ο΅ italic-Ο΅\epsilon italic_Ο΅ and m π m italic_m. The results in Figures1(b)&2(b) show the percentage of constructed code sets that include the target program for varying Ο΅ italic-Ο΅\epsilon italic_Ο΅ and m π m italic_m.
For both datasets, our approach constructs prediction sets that remove significantly fewer nodes than the baseline. For instance, with Ο΅=0.15 italic-Ο΅ 0.15\epsilon=0.15 italic_Ο΅ = 0.15, our approach only removes about 45% of nodes for SQL (vs. 55% for the baseline), and about 55% of nodes for Python (vs. 65% for the baseline). Furthermore, our approach achieves better performance while achieving similar coverage rate (both our approach and the baseline achieve the desired coverage rate in all cases). There is remaining slack compared to the desired coverage rate to account for generalization error; this slack can be reduced by using more samples.
Additionally, we note that the coverage rate decreases as Ο΅ italic-Ο΅\epsilon italic_Ο΅ increases, demonstrating that our approach is not overly conservative. In particular, the number of nodes removed is due in large part to errors in the underlying language model.
Interestingly, the performance of our algorithm is not very sensitive to m π m italic_m, suggesting that m=1 π 1 m=1 italic_m = 1 typically suffices for constructing prediction sets. Intuitively, this finding suggests that most errors are localized, though more analysis is needed to understand this phenomenon.
Finally, we show the coverage of the top-k π k italic_k predictions of the baseline in Figure4. As can be seen, the coverage of just the top-1 1 1 1 example is around 70% for PICARD and around 74% for Codex. The coverage can be improved slightly, but not substantially, by increasing k π k italic_k. This result justifies our choice of using partial programs to represent prediction sets.
6 Conclusion
In this work, we presented an algorithm for constructing PAC prediction sets for large language models for code generation and semantic parsing. Our approach constructs compact prediction sets in the form of partial programs, leveraging an optimization-based approach to construct a monotonic set of candidate prediction sets and then using an existing algorithm to rigorously select a prediction set threshold. Our experiments demonstrate that our approach constructs valid prediction sets while significantly outperforming a naΓ―ve baseline that uses a greedy heuristic instead of optimization. While our approach is tailored to code generation, we anticipate that the general principle of using optimization to construct compact prediction sets can be applied to many structure prediction problems.
Limitations. We briefly summarize several limitations of our approach. First, users are required to specify an error tolerance for our algorithm to be applied. Furthermore, our approach relies heavily on the performance of the underlying, well-trained generative model, and may exhibit limitations if the base model is not sufficiently accurate. Finally, we have studied only one way of representing prediction sets, and others may be possible. User studies would be needed to compare the effectiveness of different choices.
Acknowledgements
We are grateful to Jeevana Inala for her helpful comments and discussions. This work was supported in part by NSF Award CCF-1910769, NSF Award CCF-1917852, and ARO Award W911NF-20-1-0080.
References
- Abdar et al. (2021) Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U.R., Makarenkov, V., and Nahavandi, S. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243β297, 2021. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2021.05.008. URL https://www.sciencedirect.com/science/article/pii/S1566253521001081.
- Angelopoulos et al. (2020) Angelopoulos, A., Bates, S., Malik, J., and Jordan, M.I. Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193, 2020.
- Bates et al. (2021) Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M.I. Distribution-free, risk-controlling prediction sets, 2021. URL https://arxiv.org/abs/2101.02703.
- Brown et al. (2020) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W.H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A.N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021. URL https://arxiv.org/abs/2107.03374.
- Guo et al. (2021) Guo, D., Svyatkovskiy, A., Yin, J., Duan, N., Brockschmidt, M., and Allamanis, M. Learning to complete code with sketches. In International Conference on Learning Representations, 2021.
- Hendrycks et al. (2021) Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring coding challenge competence with apps. NeurIPS, 2021.
- Iqbal & Qureshi (2022) Iqbal, T. and Qureshi, S. The survey: Text generation models in deep learning. Journal of King Saud University - Computer and Information Sciences, 34(6, Part A):2515β2528, 2022. ISSN 1319-1578. doi: https://doi.org/10.1016/j.jksuci.2020.04.001. URL https://www.sciencedirect.com/science/article/pii/S1319157820303360.
- Liu et al. (2022) Liu, T., Jiang, Y., Monath, N., Cotterell, R., and Sachan, M. Autoregressive structured prediction with language models, 2022. URL https://arxiv.org/abs/2210.14698.
- Park et al. (2020) Park, S., Bastani, O., Matni, N., and Lee, I. Pac confidence sets for deep neural networks via calibrated prediction, 2020. URL https://arxiv.org/abs/2001.00106.
- Raffel et al. (2019) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683.
- Scholak et al. (2021) Scholak, T., Schucher, N., and Bahdanau, D. Picard: Parsing incrementally for constrained auto-regressive decoding from language models, 2021. URL https://arxiv.org/abs/2109.05093.
- (13) Schulz, H. and Behnke, S. Structured prediction for object detection in deep neural networks. URL https://www.ais.uni-bonn.de/papers/icann2014_schulz.pdf.
- Sennrich et al. (2015) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units, 2015. URL https://arxiv.org/abs/1508.07909.
- Solar-Lezama (2008) Solar-Lezama, A. Program synthesis by sketching. University of California, Berkeley, 2008.
- Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q.V. Sequence to sequence learning with neural networks, 2014. URL https://arxiv.org/abs/1409.3215.
- Tibshirani et al. (2019) Tibshirani, R.J., Foygel Barber, R., Candes, E., and Ramdas, A. Conformal prediction under covariate shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchΓ©-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/8fb21ee7a2207526da55a679f0332de2-Paper.pdf.
- Vovk et al. (2005) Vovk, V., Gammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522.
- Wilks (1941) Wilks, S.S. Determination of Sample Sizes for Setting Tolerance Limits. The Annals of Mathematical Statistics, 12(1):91 β 96, 1941. doi: 10.1214/aoms/1177731788. URL https://doi.org/10.1214/aoms/1177731788.
- Yu et al. (2018) Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., and Radev, D. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task, 2018. URL https://arxiv.org/abs/1809.08887.
Appendix A Sample Outputs
SELECT t1.name, t1.capacity FROM stadium AS t1
JOIN concert AS t2 ON t1.stadium_id = t2.stadium_id
WHERE t2.year > 2014
GROUP BY t1.stadium_id
ORDER BY COUNT(*)
DESC LIMIT 1;
(a)Ground truth abstract syntax tree (AST) and SQL query from the Spider dataset (Yu et al., 2018)
SELECT t1.name, t1.capacity FROM stadium AS t1
JOIN concert AS t2 ON t1.stadium_id = t2.stadium_id
WHERE t2.year > 2013
GROUP BY t2.stadium_id
ORDER BY COUNT()
DESC LIMIT 1;
SELECT t1.name, t1.capacity FROM stadium AS t1
JOIN concert AS t2 ON t1.stadium_id = t2.stadium_id
WHERE t2.year > ??
GROUP BY ??
ORDER BY COUNT()
DESC LIMIT 1;
(b)Predicted AST, predicted SQL Query, and resulting prediction set for the same task as in 4(a) with m=2 π 2 m=2 italic_m = 2 holes. Note that the model predicts the full AST depicted above, and our PAC structured prediction set algorithm removes two subtrees (nodes removed shown with an βxβ).
Figure 5: Example of ground truth SQL query, predicted SQL query, and the constructed prediction set. Note that without the subtree removal in the predicted AST, the resulting query is incorrect. With the two holes in the SQL query, the code prediction set contains the ground truth query.
(a)Ground truth abstract syntax tree (AST) and Python line of code in the APPS dataset (Chen et al., 2021).
return fib(n-0) + fib(n-3)
return fib(??) + fib(n-3)
(b)Predicted AST, predicted Python code, and prediction set for the same task as in 5(a) with m=1 π 1 m=1 italic_m = 1 hole1. Note that the model predicts the full AST depicted above, and our PAC structured prediction set algorithm removes a subtree (nodes removed shown with an βxβ).
Figure 6: Example of ground truth Python line of code, predicted line of code, and constructed prediction set with a single subtree removal, m=1 π 1 m=1 italic_m = 1. Note that with a single subtree removal, the resulting code prediction set does not contain the ground truth code.
(a)Ground truth abstract syntax tree (AST) and Python line of code in the APPS dataset (Chen et al., 2021).
return fib(n-0) + fib(n-3)
return fib(n-??) + fib(n-??)
(b)Predicted AST, predicted Python code, and prediction set for the same task as in 6(a) with m=2 π 2 m=2 italic_m = 2 holes. Note that the model predicts the full AST depicted above, and our PAC structured prediction set algorithm removes two subtrees (nodes removed shown with an βxβ).
Figure 7: Example of ground truth Python line of code, predicted line of code, and constructed prediction set with two subtrees removed, m=2 π 2 m=2 italic_m = 2. Note that without the subtree removal in the predicted AST, the resulting query is incorrect. With the two holes in the Python line of code, the code prediction set contains the ground truth code.
return sorted(words, key = lambda x: (-len(set(x)), x))[0]
(a)Ground truth abstract syntax tree (AST) and Python line of code from the APPS dataset (Chen et al., 2021).
return max(words, key=lambda x: len(set(x)))
return ??(words, key=lambda : ??)
(b)Predicted AST and resulting code-prediction set for the same task as in 7(a) with m=3 π 3 m=3 italic_m = 3 holes. Note that the model predicts the full AST depicted above, and our PAC structured prediction set algorithm removes three subtrees (nodes removed shown with an βxβ).
Figure 8: Example of ground truth Python line of code, predicted line of code, and constructed prediction set with three subtrees removed, m=3 π 3 m=3 italic_m = 3. Note that without the subtree removals in the predicted AST, the resulting query is incorrect. With the three holes in the Python line of code, the code prediction set contains the ground truth code.
Xet Storage Details
- Size:
- 101 kB
- Xet hash:
- 37eba3b3a28588b7988285826ed030f0fe447f99457d4d3e87e401be960891fd
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.







