phanerozoic commited on
Commit
30ce71b
·
verified ·
1 Parent(s): 4d79fd9

Composition boundary: machine-checked factorization/Bayes/JSD theorem + Nucleotide Transformer sweep showing D predicts solvability

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. CompositionBoundary.v +260 -0
  3. README.md +28 -0
  4. composition_boundary.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  manifold.png filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  manifold.png filter=lfs diff=lfs merge=lfs -text
37
+ composition_boundary.png filter=lfs diff=lfs merge=lfs -text
CompositionBoundary.v ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ (* ============================================================================
2
+ The composition boundary, machine-checked in Rocq 9.
3
+
4
+ A composition classifier sees a sequence only through its symbol counts, a
5
+ sufficient statistic of an iid model, so model a length-N sequence over V
6
+ symbols as N independent draws from a class-conditional distribution P0 or P1
7
+ (a list of V nonnegative probabilities).
8
+
9
+ The Bhattacharyya coefficient of the two class distributions over sequences,
10
+ BCseq = sum over all length-N sequences s of sqrt(P0(s) * P1(s)), factorizes
11
+ exactly over the independent coordinates:
12
+
13
+ BCseq = rho ^ N, rho = sum_g sqrt(P0_g * P1_g).
14
+
15
+ The Bayes error of the equiprobable two-class test is 1/2 * sum min(P0,P1),
16
+ bounded by 1/2 * BCseq = 1/2 * rho^N. Hence the dichotomy:
17
+ - identical class distributions (P0 = P1, normalized) give rho = 1, so
18
+ BCseq = 1 and the Bayes error is exactly 1/2: composition is at chance;
19
+ - rho < 1 gives BCseq = rho^N, nonincreasing in N and bounded by rho, so
20
+ the classes separate and composition reaches the Bayes rate.
21
+
22
+ Proved over the reals with no axioms and no admits.
23
+ ========================================================================== *)
24
+
25
+ From Stdlib Require Import List Reals Lra Lia.
26
+ Import ListNotations.
27
+ Open Scope R_scope.
28
+
29
+ Definition Rsum : list R -> R := fold_right Rplus 0.
30
+
31
+ (* ---------- list-sum helpers ---------- *)
32
+
33
+ Lemma Rsum_app : forall l1 l2, Rsum (l1 ++ l2) = Rsum l1 + Rsum l2.
34
+ Proof. induction l1; intros; simpl; [lra | rewrite IHl1; lra]. Qed.
35
+
36
+ Lemma Rsum_map_mult_l : forall (A : Type) (c : R) (f : A -> R) (l : list A),
37
+ Rsum (map (fun x => c * f x) l) = c * Rsum (map f l).
38
+ Proof. intros A c f; induction l; simpl; [ring | rewrite IHl; ring]. Qed.
39
+
40
+ Lemma Rsum_map_mult_r : forall (A : Type) (c : R) (f : A -> R) (l : list A),
41
+ Rsum (map (fun x => f x * c) l) = Rsum (map f l) * c.
42
+ Proof. intros A c f; induction l; simpl; [ring | rewrite IHl; ring]. Qed.
43
+
44
+ Lemma Rsum_map_le : forall (A : Type) (f g : A -> R) (l : list A),
45
+ (forall x, In x l -> f x <= g x) -> Rsum (map f l) <= Rsum (map g l).
46
+ Proof.
47
+ intros A f g; induction l as [|x l IH]; intros H; simpl; [lra|].
48
+ apply Rplus_le_compat; [apply H; left; reflexivity | apply IH; intros y Hy; apply H; right; exact Hy].
49
+ Qed.
50
+
51
+ Lemma Rsum_flat_map : forall (A : Type) (F : list nat -> R) (G : A -> list (list nat)) (L : list A),
52
+ Rsum (map F (flat_map G L)) = Rsum (map (fun x => Rsum (map F (G x))) L).
53
+ Proof.
54
+ intros A F G; induction L as [|x L IH]; simpl; [reflexivity|].
55
+ rewrite map_app, Rsum_app, IH; reflexivity.
56
+ Qed.
57
+
58
+ (* ---------- iid model ---------- *)
59
+
60
+ Definition jp (P : list R) (s : list nat) : R :=
61
+ fold_right Rmult 1 (map (fun g => nth g P 0) s).
62
+
63
+ Lemma jp_cons : forall P g s, jp P (g :: s) = nth g P 0 * jp P s.
64
+ Proof. reflexivity. Qed.
65
+
66
+ Fixpoint allseq (V N : nat) : list (list nat) :=
67
+ match N with
68
+ | 0 => [ [] ]
69
+ | S n => flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V n)
70
+ end.
71
+
72
+ Definition rho1 (P0 P1 : list R) (V : nat) : R :=
73
+ Rsum (map (fun g => sqrt (nth g P0 0 * nth g P1 0)) (seq 0 V)).
74
+
75
+ Definition BCseq (P0 P1 : list R) (V N : nat) : R :=
76
+ Rsum (map (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N)).
77
+
78
+ (* ---------- nonnegativity ---------- *)
79
+
80
+ Lemma nth_nonneg : forall P g, Forall (Rle 0) P -> 0 <= nth g P 0.
81
+ Proof.
82
+ intros P g H. destruct (Nat.lt_ge_cases g (length P)) as [Hlt | Hge].
83
+ - rewrite Forall_nth in H. apply H. exact Hlt.
84
+ - rewrite nth_overflow by lia. lra.
85
+ Qed.
86
+
87
+ Lemma jp_nonneg : forall P s, Forall (Rle 0) P -> 0 <= jp P s.
88
+ Proof.
89
+ intros P s H. induction s as [|g s IH].
90
+ - unfold jp; simpl; lra.
91
+ - rewrite jp_cons. apply Rmult_le_pos; [apply nth_nonneg; exact H | exact IH].
92
+ Qed.
93
+
94
+ (* ---------- per-coordinate sums ---------- *)
95
+
96
+ Lemma inner_sum : forall P0 P1, Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
97
+ forall V s,
98
+ Rsum (map (fun s' => sqrt (jp P0 s' * jp P1 s')) (map (fun g => g :: s) (seq 0 V)))
99
+ = rho1 P0 P1 V * sqrt (jp P0 s * jp P1 s).
100
+ Proof.
101
+ intros P0 P1 H0 H1 V s. rewrite map_map. cbv beta.
102
+ assert (BODY : forall g, sqrt (jp P0 (g :: s) * jp P1 (g :: s))
103
+ = sqrt (nth g P0 0 * nth g P1 0) * sqrt (jp P0 s * jp P1 s)).
104
+ { intro g. rewrite !jp_cons.
105
+ replace (nth g P0 0 * jp P0 s * (nth g P1 0 * jp P1 s))
106
+ with ((nth g P0 0 * nth g P1 0) * (jp P0 s * jp P1 s)) by ring.
107
+ rewrite sqrt_mult.
108
+ - reflexivity.
109
+ - apply Rmult_le_pos; apply nth_nonneg; assumption.
110
+ - apply Rmult_le_pos; apply jp_nonneg; assumption. }
111
+ rewrite (map_ext _ _ BODY).
112
+ rewrite (Rsum_map_mult_r _ (sqrt (jp P0 s * jp P1 s)) (fun g => sqrt (nth g P0 0 * nth g P1 0)) (seq 0 V)).
113
+ unfold rho1. reflexivity.
114
+ Qed.
115
+
116
+ Lemma inner_total : forall P V s,
117
+ Rsum (map (fun s' => jp P s') (map (fun g => g :: s) (seq 0 V)))
118
+ = Rsum (map (fun g => nth g P 0) (seq 0 V)) * jp P s.
119
+ Proof.
120
+ intros P V s. rewrite map_map. cbv beta.
121
+ assert (BODY : forall g, jp P (g :: s) = nth g P 0 * jp P s) by (intro g; apply jp_cons).
122
+ rewrite (map_ext _ _ BODY).
123
+ rewrite (Rsum_map_mult_r _ (jp P s) (fun g => nth g P 0) (seq 0 V)). reflexivity.
124
+ Qed.
125
+
126
+ (* ---------- the factorization: BCseq = rho ^ N ---------- *)
127
+
128
+ Theorem BC_factorizes : forall P0 P1 V N,
129
+ Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
130
+ BCseq P0 P1 V N = (rho1 P0 P1 V) ^ N.
131
+ Proof.
132
+ intros P0 P1 V N H0 H1. induction N as [|N IH].
133
+ - unfold BCseq. simpl. unfold jp; simpl. rewrite Rmult_1_r, sqrt_1. simpl; lra.
134
+ - unfold BCseq.
135
+ replace (allseq V (S N)) with (flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V N)) by reflexivity.
136
+ rewrite Rsum_flat_map. cbv beta.
137
+ rewrite (map_ext _ _ (inner_sum P0 P1 H0 H1 V)).
138
+ rewrite (Rsum_map_mult_l _ (rho1 P0 P1 V) (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N)).
139
+ change (Rsum (map (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N))) with (BCseq P0 P1 V N).
140
+ rewrite IH. reflexivity.
141
+ Qed.
142
+
143
+ Theorem total_prob : forall P V N,
144
+ Rsum (map (fun s => jp P s) (allseq V N)) = (Rsum (map (fun g => nth g P 0) (seq 0 V))) ^ N.
145
+ Proof.
146
+ intros P V N. induction N as [|N IH].
147
+ - simpl. unfold jp; simpl. lra.
148
+ - replace (allseq V (S N)) with (flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V N)) by reflexivity.
149
+ rewrite Rsum_flat_map. cbv beta.
150
+ rewrite (map_ext _ _ (inner_total P V)).
151
+ rewrite (Rsum_map_mult_l _ (Rsum (map (fun g => nth g P 0) (seq 0 V))) (fun s => jp P s) (allseq V N)).
152
+ rewrite IH. reflexivity.
153
+ Qed.
154
+
155
+ (* ---------- Bayes error ---------- *)
156
+
157
+ Definition pe (P0 P1 : list R) (V N : nat) : R :=
158
+ (1/2) * Rsum (map (fun s => Rmin (jp P0 s) (jp P1 s)) (allseq V N)).
159
+
160
+ Lemma Rmin_le_sqrt_mult : forall a b, 0 <= a -> 0 <= b -> Rmin a b <= sqrt (a * b).
161
+ Proof.
162
+ intros a b Ha Hb. apply Rsqr_incr_0.
163
+ - unfold Rsqr. rewrite sqrt_sqrt by (apply Rmult_le_pos; assumption).
164
+ destruct (Rle_dec a b) as [H|H].
165
+ + rewrite Rmin_left by exact H. apply Rmult_le_compat_l; [exact Ha | exact H].
166
+ + rewrite Rmin_right by lra. rewrite (Rmult_comm a b).
167
+ apply Rmult_le_compat_l; [exact Hb | lra].
168
+ - apply Rmin_glb; assumption.
169
+ - apply sqrt_pos.
170
+ Qed.
171
+
172
+ Theorem bayes_bound : forall P0 P1 V N,
173
+ Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
174
+ pe P0 P1 V N <= (1/2) * (rho1 P0 P1 V) ^ N.
175
+ Proof.
176
+ intros P0 P1 V N H0 H1. unfold pe.
177
+ rewrite <- BC_factorizes by assumption. unfold BCseq.
178
+ apply Rmult_le_compat_l; [lra|].
179
+ apply Rsum_map_le. intros s _. apply Rmin_le_sqrt_mult; apply jp_nonneg; assumption.
180
+ Qed.
181
+
182
+ (* ---------- dichotomy ---------- *)
183
+
184
+ Theorem degenerate_is_chance : forall P V N,
185
+ Forall (Rle 0) P ->
186
+ Rsum (map (fun g => nth g P 0) (seq 0 V)) = 1 ->
187
+ rho1 P P V = 1 /\ pe P P V N = 1/2.
188
+ Proof.
189
+ intros P V N HP Hnorm. split.
190
+ - unfold rho1.
191
+ assert (B : forall g, sqrt (nth g P 0 * nth g P 0) = nth g P 0).
192
+ { intro g. replace (nth g P 0 * nth g P 0) with (Rsqr (nth g P 0)) by (unfold Rsqr; ring).
193
+ apply sqrt_Rsqr. apply nth_nonneg; exact HP. }
194
+ rewrite (map_ext _ _ B). exact Hnorm.
195
+ - unfold pe.
196
+ assert (B : forall s, Rmin (jp P s) (jp P s) = jp P s) by (intro s; apply Rmin_left; apply Rle_refl).
197
+ rewrite (map_ext _ _ B). rewrite total_prob, Hnorm, pow1. lra.
198
+ Qed.
199
+
200
+ Lemma pow_le_one : forall r n, 0 <= r -> r <= 1 -> r ^ n <= 1.
201
+ Proof.
202
+ intros r n Hr0 Hr1. induction n as [|n IH]; simpl; [lra|].
203
+ apply Rle_trans with (r * 1); [apply Rmult_le_compat_l; [exact Hr0 | exact IH] | lra].
204
+ Qed.
205
+
206
+ Theorem separable_shrinks : forall r N,
207
+ 0 <= r -> r < 1 -> r ^ (S N) <= r ^ N /\ r ^ (S N) <= r.
208
+ Proof.
209
+ intros r N Hr0 Hr1. split; simpl.
210
+ - apply Rle_trans with (1 * r ^ N); [apply Rmult_le_compat_r; [apply pow_le; exact Hr0 | lra] | lra].
211
+ - apply Rle_trans with (r * 1); [apply Rmult_le_compat_l; [exact Hr0 | apply pow_le_one; lra] | lra].
212
+ Qed.
213
+
214
+ (* ============================================================================
215
+ The Jensen-Shannon identity tying the boundary to mutual information.
216
+
217
+ For a balanced binary label Y with X|Y distributed as P or Q, the mutual
218
+ information I(Y;X) equals 1/2 KL(P||M) + 1/2 KL(Q||M) with M = (P+Q)/2, the
219
+ information form of the Jensen-Shannon divergence. That equals the entropy
220
+ form H(M) - 1/2 H(P) - 1/2 H(Q). Below, jsd_info is the information form
221
+ summed coordinatewise and jsd_entropy is the entropy form; they coincide on
222
+ strictly positive distributions. Since X may be the k-mer count vector, this
223
+ identifies the boundary's information fraction with a mutual information.
224
+ ========================================================================== *)
225
+
226
+ Lemma lndiv : forall x y, 0 < x -> 0 < y -> ln (x / y) = ln x - ln y.
227
+ Proof.
228
+ intros x y Hx Hy. unfold Rdiv.
229
+ rewrite ln_mult by (first [exact Hx | apply Rinv_0_lt_compat; exact Hy]).
230
+ rewrite ln_Rinv by exact Hy. ring.
231
+ Qed.
232
+
233
+ Fixpoint jsd_info (P Q : list R) : R :=
234
+ match P, Q with
235
+ | p :: P', q :: Q' =>
236
+ (1/2) * (p * ln (p / ((p + q) / 2))) + (1/2) * (q * ln (q / ((p + q) / 2))) + jsd_info P' Q'
237
+ | _, _ => 0
238
+ end.
239
+
240
+ Fixpoint jsd_entropy (P Q : list R) : R :=
241
+ match P, Q with
242
+ | p :: P', q :: Q' =>
243
+ - (((p + q) / 2) * ln ((p + q) / 2)) + (1/2) * (p * ln p) + (1/2) * (q * ln q) + jsd_entropy P' Q'
244
+ | _, _ => 0
245
+ end.
246
+
247
+ Theorem jsd_info_eq_entropy : forall P Q,
248
+ Forall (Rlt 0) P -> Forall (Rlt 0) Q -> length P = length Q ->
249
+ jsd_info P Q = jsd_entropy P Q.
250
+ Proof.
251
+ intros P. induction P as [|p P IH]; intros Q HP HQ Hlen;
252
+ destruct Q as [|q Q]; simpl in *; try reflexivity; try discriminate.
253
+ inversion HP as [|x xs Hp HP' Heq0]; subst.
254
+ inversion HQ as [|x xs Hq HQ' Heq1]; subst.
255
+ assert (Hlen' : length P = length Q) by lia.
256
+ rewrite (lndiv p ((p + q) / 2)) by lra.
257
+ rewrite (lndiv q ((p + q) / 2)) by lra.
258
+ rewrite (IH Q HP' HQ' Hlen').
259
+ field.
260
+ Qed.
README.md CHANGED
@@ -226,6 +226,34 @@ language model resists it) and `TOOL.md` (footprint, throughput, read-length, op
226
  coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology
227
  lives.*
228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
  ## Screening and hybrid use
230
 
231
  - **Certified screening frontier.** To pass an engineered construct as natural an adversary must
 
226
  coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology
227
  lives.*
228
 
229
+ ## The composition boundary
230
+
231
+ The partition has a boundary you can compute before training and a proof that it holds. View a
232
+ length-L sequence as N = L - k + 1 draws from a class-conditional k-mer distribution, which is all a
233
+ composition classifier sees. The Bhattacharyya coefficient between the two classes' count
234
+ distributions factorizes exactly to rho^N, with rho the per-k-mer overlap; the Bayes error of any
235
+ count-based classifier is at most half of rho^N; and for a balanced label the mutual information
236
+ equals the Jensen-Shannon divergence between the count distributions. Solvability is therefore set by
237
+ D = -N ln rho, read off the class distributions with no classifier: equal distributions give rho = 1
238
+ and chance, rho < 1 gives an error that vanishes with length. These four statements are machine
239
+ checked in `CompositionBoundary.v` (Rocq, over Coq's standard reals, no `admit`).
240
+
241
+ ![The composition boundary on a foundation-model benchmark](composition_boundary.png)
242
+
243
+ Across the sixteen binary Nucleotide Transformer tasks, D predicts task by task whether composition
244
+ suffices. On the high-D tasks the 2 MB linear k-mer model is already at the ceiling, promoters at
245
+ 0.96 and the H3 and H4 marks at 0.84 to 0.88, within one to two AUROC points of a CNN trained on the
246
+ same data, so no model has room to add much. The learned model's advantage sits on the low-D tasks
247
+ the criterion names in advance, splice donor and acceptor sites most clearly, where the discriminating
248
+ motif is positional and falls beneath the composition a bag of k-mers reads. The protein side matches:
249
+ amino-acid composition separates eukaryote from bacterium above the boundary, while the
250
+ substitution-effect signal behind variant pathogenicity sits below it; within-protein fitness assays
251
+ such as ProteinGym are mutant ensembles rather than composition samples, outside this model, and they
252
+ fall in the same deep regime. The consequence is a re-scoring rather than a new tool: a
253
+ sequence-classification benchmark can be ordered by D with no training, its above-boundary entries are
254
+ k-mer baselines a 2 MB model matches, and a foundation model's distinct contribution is whatever it
255
+ adds on the entries below.
256
+
257
  ## Screening and hybrid use
258
 
259
  - **Certified screening frontier.** To pass an engineered construct as natural an adversary must
composition_boundary.png ADDED

Git LFS Details

  • SHA256: 0cc933845f4be634c81e1d6099a68c0f97220cd55fd8dcf49bda556a14075343
  • Pointer size: 131 Bytes
  • Size of remote file: 103 kB