Composition boundary: machine-checked factorization/Bayes/JSD theorem + Nucleotide Transformer sweep showing D predicts solvability
Browse files- .gitattributes +1 -0
- CompositionBoundary.v +260 -0
- README.md +28 -0
- composition_boundary.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
manifold.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
manifold.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
composition_boundary.png filter=lfs diff=lfs merge=lfs -text
|
CompositionBoundary.v
ADDED
|
@@ -0,0 +1,260 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
(* ============================================================================
|
| 2 |
+
The composition boundary, machine-checked in Rocq 9.
|
| 3 |
+
|
| 4 |
+
A composition classifier sees a sequence only through its symbol counts, a
|
| 5 |
+
sufficient statistic of an iid model, so model a length-N sequence over V
|
| 6 |
+
symbols as N independent draws from a class-conditional distribution P0 or P1
|
| 7 |
+
(a list of V nonnegative probabilities).
|
| 8 |
+
|
| 9 |
+
The Bhattacharyya coefficient of the two class distributions over sequences,
|
| 10 |
+
BCseq = sum over all length-N sequences s of sqrt(P0(s) * P1(s)), factorizes
|
| 11 |
+
exactly over the independent coordinates:
|
| 12 |
+
|
| 13 |
+
BCseq = rho ^ N, rho = sum_g sqrt(P0_g * P1_g).
|
| 14 |
+
|
| 15 |
+
The Bayes error of the equiprobable two-class test is 1/2 * sum min(P0,P1),
|
| 16 |
+
bounded by 1/2 * BCseq = 1/2 * rho^N. Hence the dichotomy:
|
| 17 |
+
- identical class distributions (P0 = P1, normalized) give rho = 1, so
|
| 18 |
+
BCseq = 1 and the Bayes error is exactly 1/2: composition is at chance;
|
| 19 |
+
- rho < 1 gives BCseq = rho^N, nonincreasing in N and bounded by rho, so
|
| 20 |
+
the classes separate and composition reaches the Bayes rate.
|
| 21 |
+
|
| 22 |
+
Proved over the reals with no axioms and no admits.
|
| 23 |
+
========================================================================== *)
|
| 24 |
+
|
| 25 |
+
From Stdlib Require Import List Reals Lra Lia.
|
| 26 |
+
Import ListNotations.
|
| 27 |
+
Open Scope R_scope.
|
| 28 |
+
|
| 29 |
+
Definition Rsum : list R -> R := fold_right Rplus 0.
|
| 30 |
+
|
| 31 |
+
(* ---------- list-sum helpers ---------- *)
|
| 32 |
+
|
| 33 |
+
Lemma Rsum_app : forall l1 l2, Rsum (l1 ++ l2) = Rsum l1 + Rsum l2.
|
| 34 |
+
Proof. induction l1; intros; simpl; [lra | rewrite IHl1; lra]. Qed.
|
| 35 |
+
|
| 36 |
+
Lemma Rsum_map_mult_l : forall (A : Type) (c : R) (f : A -> R) (l : list A),
|
| 37 |
+
Rsum (map (fun x => c * f x) l) = c * Rsum (map f l).
|
| 38 |
+
Proof. intros A c f; induction l; simpl; [ring | rewrite IHl; ring]. Qed.
|
| 39 |
+
|
| 40 |
+
Lemma Rsum_map_mult_r : forall (A : Type) (c : R) (f : A -> R) (l : list A),
|
| 41 |
+
Rsum (map (fun x => f x * c) l) = Rsum (map f l) * c.
|
| 42 |
+
Proof. intros A c f; induction l; simpl; [ring | rewrite IHl; ring]. Qed.
|
| 43 |
+
|
| 44 |
+
Lemma Rsum_map_le : forall (A : Type) (f g : A -> R) (l : list A),
|
| 45 |
+
(forall x, In x l -> f x <= g x) -> Rsum (map f l) <= Rsum (map g l).
|
| 46 |
+
Proof.
|
| 47 |
+
intros A f g; induction l as [|x l IH]; intros H; simpl; [lra|].
|
| 48 |
+
apply Rplus_le_compat; [apply H; left; reflexivity | apply IH; intros y Hy; apply H; right; exact Hy].
|
| 49 |
+
Qed.
|
| 50 |
+
|
| 51 |
+
Lemma Rsum_flat_map : forall (A : Type) (F : list nat -> R) (G : A -> list (list nat)) (L : list A),
|
| 52 |
+
Rsum (map F (flat_map G L)) = Rsum (map (fun x => Rsum (map F (G x))) L).
|
| 53 |
+
Proof.
|
| 54 |
+
intros A F G; induction L as [|x L IH]; simpl; [reflexivity|].
|
| 55 |
+
rewrite map_app, Rsum_app, IH; reflexivity.
|
| 56 |
+
Qed.
|
| 57 |
+
|
| 58 |
+
(* ---------- iid model ---------- *)
|
| 59 |
+
|
| 60 |
+
Definition jp (P : list R) (s : list nat) : R :=
|
| 61 |
+
fold_right Rmult 1 (map (fun g => nth g P 0) s).
|
| 62 |
+
|
| 63 |
+
Lemma jp_cons : forall P g s, jp P (g :: s) = nth g P 0 * jp P s.
|
| 64 |
+
Proof. reflexivity. Qed.
|
| 65 |
+
|
| 66 |
+
Fixpoint allseq (V N : nat) : list (list nat) :=
|
| 67 |
+
match N with
|
| 68 |
+
| 0 => [ [] ]
|
| 69 |
+
| S n => flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V n)
|
| 70 |
+
end.
|
| 71 |
+
|
| 72 |
+
Definition rho1 (P0 P1 : list R) (V : nat) : R :=
|
| 73 |
+
Rsum (map (fun g => sqrt (nth g P0 0 * nth g P1 0)) (seq 0 V)).
|
| 74 |
+
|
| 75 |
+
Definition BCseq (P0 P1 : list R) (V N : nat) : R :=
|
| 76 |
+
Rsum (map (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N)).
|
| 77 |
+
|
| 78 |
+
(* ---------- nonnegativity ---------- *)
|
| 79 |
+
|
| 80 |
+
Lemma nth_nonneg : forall P g, Forall (Rle 0) P -> 0 <= nth g P 0.
|
| 81 |
+
Proof.
|
| 82 |
+
intros P g H. destruct (Nat.lt_ge_cases g (length P)) as [Hlt | Hge].
|
| 83 |
+
- rewrite Forall_nth in H. apply H. exact Hlt.
|
| 84 |
+
- rewrite nth_overflow by lia. lra.
|
| 85 |
+
Qed.
|
| 86 |
+
|
| 87 |
+
Lemma jp_nonneg : forall P s, Forall (Rle 0) P -> 0 <= jp P s.
|
| 88 |
+
Proof.
|
| 89 |
+
intros P s H. induction s as [|g s IH].
|
| 90 |
+
- unfold jp; simpl; lra.
|
| 91 |
+
- rewrite jp_cons. apply Rmult_le_pos; [apply nth_nonneg; exact H | exact IH].
|
| 92 |
+
Qed.
|
| 93 |
+
|
| 94 |
+
(* ---------- per-coordinate sums ---------- *)
|
| 95 |
+
|
| 96 |
+
Lemma inner_sum : forall P0 P1, Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
|
| 97 |
+
forall V s,
|
| 98 |
+
Rsum (map (fun s' => sqrt (jp P0 s' * jp P1 s')) (map (fun g => g :: s) (seq 0 V)))
|
| 99 |
+
= rho1 P0 P1 V * sqrt (jp P0 s * jp P1 s).
|
| 100 |
+
Proof.
|
| 101 |
+
intros P0 P1 H0 H1 V s. rewrite map_map. cbv beta.
|
| 102 |
+
assert (BODY : forall g, sqrt (jp P0 (g :: s) * jp P1 (g :: s))
|
| 103 |
+
= sqrt (nth g P0 0 * nth g P1 0) * sqrt (jp P0 s * jp P1 s)).
|
| 104 |
+
{ intro g. rewrite !jp_cons.
|
| 105 |
+
replace (nth g P0 0 * jp P0 s * (nth g P1 0 * jp P1 s))
|
| 106 |
+
with ((nth g P0 0 * nth g P1 0) * (jp P0 s * jp P1 s)) by ring.
|
| 107 |
+
rewrite sqrt_mult.
|
| 108 |
+
- reflexivity.
|
| 109 |
+
- apply Rmult_le_pos; apply nth_nonneg; assumption.
|
| 110 |
+
- apply Rmult_le_pos; apply jp_nonneg; assumption. }
|
| 111 |
+
rewrite (map_ext _ _ BODY).
|
| 112 |
+
rewrite (Rsum_map_mult_r _ (sqrt (jp P0 s * jp P1 s)) (fun g => sqrt (nth g P0 0 * nth g P1 0)) (seq 0 V)).
|
| 113 |
+
unfold rho1. reflexivity.
|
| 114 |
+
Qed.
|
| 115 |
+
|
| 116 |
+
Lemma inner_total : forall P V s,
|
| 117 |
+
Rsum (map (fun s' => jp P s') (map (fun g => g :: s) (seq 0 V)))
|
| 118 |
+
= Rsum (map (fun g => nth g P 0) (seq 0 V)) * jp P s.
|
| 119 |
+
Proof.
|
| 120 |
+
intros P V s. rewrite map_map. cbv beta.
|
| 121 |
+
assert (BODY : forall g, jp P (g :: s) = nth g P 0 * jp P s) by (intro g; apply jp_cons).
|
| 122 |
+
rewrite (map_ext _ _ BODY).
|
| 123 |
+
rewrite (Rsum_map_mult_r _ (jp P s) (fun g => nth g P 0) (seq 0 V)). reflexivity.
|
| 124 |
+
Qed.
|
| 125 |
+
|
| 126 |
+
(* ---------- the factorization: BCseq = rho ^ N ---------- *)
|
| 127 |
+
|
| 128 |
+
Theorem BC_factorizes : forall P0 P1 V N,
|
| 129 |
+
Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
|
| 130 |
+
BCseq P0 P1 V N = (rho1 P0 P1 V) ^ N.
|
| 131 |
+
Proof.
|
| 132 |
+
intros P0 P1 V N H0 H1. induction N as [|N IH].
|
| 133 |
+
- unfold BCseq. simpl. unfold jp; simpl. rewrite Rmult_1_r, sqrt_1. simpl; lra.
|
| 134 |
+
- unfold BCseq.
|
| 135 |
+
replace (allseq V (S N)) with (flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V N)) by reflexivity.
|
| 136 |
+
rewrite Rsum_flat_map. cbv beta.
|
| 137 |
+
rewrite (map_ext _ _ (inner_sum P0 P1 H0 H1 V)).
|
| 138 |
+
rewrite (Rsum_map_mult_l _ (rho1 P0 P1 V) (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N)).
|
| 139 |
+
change (Rsum (map (fun s => sqrt (jp P0 s * jp P1 s)) (allseq V N))) with (BCseq P0 P1 V N).
|
| 140 |
+
rewrite IH. reflexivity.
|
| 141 |
+
Qed.
|
| 142 |
+
|
| 143 |
+
Theorem total_prob : forall P V N,
|
| 144 |
+
Rsum (map (fun s => jp P s) (allseq V N)) = (Rsum (map (fun g => nth g P 0) (seq 0 V))) ^ N.
|
| 145 |
+
Proof.
|
| 146 |
+
intros P V N. induction N as [|N IH].
|
| 147 |
+
- simpl. unfold jp; simpl. lra.
|
| 148 |
+
- replace (allseq V (S N)) with (flat_map (fun s => map (fun g => g :: s) (seq 0 V)) (allseq V N)) by reflexivity.
|
| 149 |
+
rewrite Rsum_flat_map. cbv beta.
|
| 150 |
+
rewrite (map_ext _ _ (inner_total P V)).
|
| 151 |
+
rewrite (Rsum_map_mult_l _ (Rsum (map (fun g => nth g P 0) (seq 0 V))) (fun s => jp P s) (allseq V N)).
|
| 152 |
+
rewrite IH. reflexivity.
|
| 153 |
+
Qed.
|
| 154 |
+
|
| 155 |
+
(* ---------- Bayes error ---------- *)
|
| 156 |
+
|
| 157 |
+
Definition pe (P0 P1 : list R) (V N : nat) : R :=
|
| 158 |
+
(1/2) * Rsum (map (fun s => Rmin (jp P0 s) (jp P1 s)) (allseq V N)).
|
| 159 |
+
|
| 160 |
+
Lemma Rmin_le_sqrt_mult : forall a b, 0 <= a -> 0 <= b -> Rmin a b <= sqrt (a * b).
|
| 161 |
+
Proof.
|
| 162 |
+
intros a b Ha Hb. apply Rsqr_incr_0.
|
| 163 |
+
- unfold Rsqr. rewrite sqrt_sqrt by (apply Rmult_le_pos; assumption).
|
| 164 |
+
destruct (Rle_dec a b) as [H|H].
|
| 165 |
+
+ rewrite Rmin_left by exact H. apply Rmult_le_compat_l; [exact Ha | exact H].
|
| 166 |
+
+ rewrite Rmin_right by lra. rewrite (Rmult_comm a b).
|
| 167 |
+
apply Rmult_le_compat_l; [exact Hb | lra].
|
| 168 |
+
- apply Rmin_glb; assumption.
|
| 169 |
+
- apply sqrt_pos.
|
| 170 |
+
Qed.
|
| 171 |
+
|
| 172 |
+
Theorem bayes_bound : forall P0 P1 V N,
|
| 173 |
+
Forall (Rle 0) P0 -> Forall (Rle 0) P1 ->
|
| 174 |
+
pe P0 P1 V N <= (1/2) * (rho1 P0 P1 V) ^ N.
|
| 175 |
+
Proof.
|
| 176 |
+
intros P0 P1 V N H0 H1. unfold pe.
|
| 177 |
+
rewrite <- BC_factorizes by assumption. unfold BCseq.
|
| 178 |
+
apply Rmult_le_compat_l; [lra|].
|
| 179 |
+
apply Rsum_map_le. intros s _. apply Rmin_le_sqrt_mult; apply jp_nonneg; assumption.
|
| 180 |
+
Qed.
|
| 181 |
+
|
| 182 |
+
(* ---------- dichotomy ---------- *)
|
| 183 |
+
|
| 184 |
+
Theorem degenerate_is_chance : forall P V N,
|
| 185 |
+
Forall (Rle 0) P ->
|
| 186 |
+
Rsum (map (fun g => nth g P 0) (seq 0 V)) = 1 ->
|
| 187 |
+
rho1 P P V = 1 /\ pe P P V N = 1/2.
|
| 188 |
+
Proof.
|
| 189 |
+
intros P V N HP Hnorm. split.
|
| 190 |
+
- unfold rho1.
|
| 191 |
+
assert (B : forall g, sqrt (nth g P 0 * nth g P 0) = nth g P 0).
|
| 192 |
+
{ intro g. replace (nth g P 0 * nth g P 0) with (Rsqr (nth g P 0)) by (unfold Rsqr; ring).
|
| 193 |
+
apply sqrt_Rsqr. apply nth_nonneg; exact HP. }
|
| 194 |
+
rewrite (map_ext _ _ B). exact Hnorm.
|
| 195 |
+
- unfold pe.
|
| 196 |
+
assert (B : forall s, Rmin (jp P s) (jp P s) = jp P s) by (intro s; apply Rmin_left; apply Rle_refl).
|
| 197 |
+
rewrite (map_ext _ _ B). rewrite total_prob, Hnorm, pow1. lra.
|
| 198 |
+
Qed.
|
| 199 |
+
|
| 200 |
+
Lemma pow_le_one : forall r n, 0 <= r -> r <= 1 -> r ^ n <= 1.
|
| 201 |
+
Proof.
|
| 202 |
+
intros r n Hr0 Hr1. induction n as [|n IH]; simpl; [lra|].
|
| 203 |
+
apply Rle_trans with (r * 1); [apply Rmult_le_compat_l; [exact Hr0 | exact IH] | lra].
|
| 204 |
+
Qed.
|
| 205 |
+
|
| 206 |
+
Theorem separable_shrinks : forall r N,
|
| 207 |
+
0 <= r -> r < 1 -> r ^ (S N) <= r ^ N /\ r ^ (S N) <= r.
|
| 208 |
+
Proof.
|
| 209 |
+
intros r N Hr0 Hr1. split; simpl.
|
| 210 |
+
- apply Rle_trans with (1 * r ^ N); [apply Rmult_le_compat_r; [apply pow_le; exact Hr0 | lra] | lra].
|
| 211 |
+
- apply Rle_trans with (r * 1); [apply Rmult_le_compat_l; [exact Hr0 | apply pow_le_one; lra] | lra].
|
| 212 |
+
Qed.
|
| 213 |
+
|
| 214 |
+
(* ============================================================================
|
| 215 |
+
The Jensen-Shannon identity tying the boundary to mutual information.
|
| 216 |
+
|
| 217 |
+
For a balanced binary label Y with X|Y distributed as P or Q, the mutual
|
| 218 |
+
information I(Y;X) equals 1/2 KL(P||M) + 1/2 KL(Q||M) with M = (P+Q)/2, the
|
| 219 |
+
information form of the Jensen-Shannon divergence. That equals the entropy
|
| 220 |
+
form H(M) - 1/2 H(P) - 1/2 H(Q). Below, jsd_info is the information form
|
| 221 |
+
summed coordinatewise and jsd_entropy is the entropy form; they coincide on
|
| 222 |
+
strictly positive distributions. Since X may be the k-mer count vector, this
|
| 223 |
+
identifies the boundary's information fraction with a mutual information.
|
| 224 |
+
========================================================================== *)
|
| 225 |
+
|
| 226 |
+
Lemma lndiv : forall x y, 0 < x -> 0 < y -> ln (x / y) = ln x - ln y.
|
| 227 |
+
Proof.
|
| 228 |
+
intros x y Hx Hy. unfold Rdiv.
|
| 229 |
+
rewrite ln_mult by (first [exact Hx | apply Rinv_0_lt_compat; exact Hy]).
|
| 230 |
+
rewrite ln_Rinv by exact Hy. ring.
|
| 231 |
+
Qed.
|
| 232 |
+
|
| 233 |
+
Fixpoint jsd_info (P Q : list R) : R :=
|
| 234 |
+
match P, Q with
|
| 235 |
+
| p :: P', q :: Q' =>
|
| 236 |
+
(1/2) * (p * ln (p / ((p + q) / 2))) + (1/2) * (q * ln (q / ((p + q) / 2))) + jsd_info P' Q'
|
| 237 |
+
| _, _ => 0
|
| 238 |
+
end.
|
| 239 |
+
|
| 240 |
+
Fixpoint jsd_entropy (P Q : list R) : R :=
|
| 241 |
+
match P, Q with
|
| 242 |
+
| p :: P', q :: Q' =>
|
| 243 |
+
- (((p + q) / 2) * ln ((p + q) / 2)) + (1/2) * (p * ln p) + (1/2) * (q * ln q) + jsd_entropy P' Q'
|
| 244 |
+
| _, _ => 0
|
| 245 |
+
end.
|
| 246 |
+
|
| 247 |
+
Theorem jsd_info_eq_entropy : forall P Q,
|
| 248 |
+
Forall (Rlt 0) P -> Forall (Rlt 0) Q -> length P = length Q ->
|
| 249 |
+
jsd_info P Q = jsd_entropy P Q.
|
| 250 |
+
Proof.
|
| 251 |
+
intros P. induction P as [|p P IH]; intros Q HP HQ Hlen;
|
| 252 |
+
destruct Q as [|q Q]; simpl in *; try reflexivity; try discriminate.
|
| 253 |
+
inversion HP as [|x xs Hp HP' Heq0]; subst.
|
| 254 |
+
inversion HQ as [|x xs Hq HQ' Heq1]; subst.
|
| 255 |
+
assert (Hlen' : length P = length Q) by lia.
|
| 256 |
+
rewrite (lndiv p ((p + q) / 2)) by lra.
|
| 257 |
+
rewrite (lndiv q ((p + q) / 2)) by lra.
|
| 258 |
+
rewrite (IH Q HP' HQ' Hlen').
|
| 259 |
+
field.
|
| 260 |
+
Qed.
|
README.md
CHANGED
|
@@ -226,6 +226,34 @@ language model resists it) and `TOOL.md` (footprint, throughput, read-length, op
|
|
| 226 |
coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology
|
| 227 |
lives.*
|
| 228 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
## Screening and hybrid use
|
| 230 |
|
| 231 |
- **Certified screening frontier.** To pass an engineered construct as natural an adversary must
|
|
|
|
| 226 |
coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology
|
| 227 |
lives.*
|
| 228 |
|
| 229 |
+
## The composition boundary
|
| 230 |
+
|
| 231 |
+
The partition has a boundary you can compute before training and a proof that it holds. View a
|
| 232 |
+
length-L sequence as N = L - k + 1 draws from a class-conditional k-mer distribution, which is all a
|
| 233 |
+
composition classifier sees. The Bhattacharyya coefficient between the two classes' count
|
| 234 |
+
distributions factorizes exactly to rho^N, with rho the per-k-mer overlap; the Bayes error of any
|
| 235 |
+
count-based classifier is at most half of rho^N; and for a balanced label the mutual information
|
| 236 |
+
equals the Jensen-Shannon divergence between the count distributions. Solvability is therefore set by
|
| 237 |
+
D = -N ln rho, read off the class distributions with no classifier: equal distributions give rho = 1
|
| 238 |
+
and chance, rho < 1 gives an error that vanishes with length. These four statements are machine
|
| 239 |
+
checked in `CompositionBoundary.v` (Rocq, over Coq's standard reals, no `admit`).
|
| 240 |
+
|
| 241 |
+

|
| 242 |
+
|
| 243 |
+
Across the sixteen binary Nucleotide Transformer tasks, D predicts task by task whether composition
|
| 244 |
+
suffices. On the high-D tasks the 2 MB linear k-mer model is already at the ceiling, promoters at
|
| 245 |
+
0.96 and the H3 and H4 marks at 0.84 to 0.88, within one to two AUROC points of a CNN trained on the
|
| 246 |
+
same data, so no model has room to add much. The learned model's advantage sits on the low-D tasks
|
| 247 |
+
the criterion names in advance, splice donor and acceptor sites most clearly, where the discriminating
|
| 248 |
+
motif is positional and falls beneath the composition a bag of k-mers reads. The protein side matches:
|
| 249 |
+
amino-acid composition separates eukaryote from bacterium above the boundary, while the
|
| 250 |
+
substitution-effect signal behind variant pathogenicity sits below it; within-protein fitness assays
|
| 251 |
+
such as ProteinGym are mutant ensembles rather than composition samples, outside this model, and they
|
| 252 |
+
fall in the same deep regime. The consequence is a re-scoring rather than a new tool: a
|
| 253 |
+
sequence-classification benchmark can be ordered by D with no training, its above-boundary entries are
|
| 254 |
+
k-mer baselines a 2 MB model matches, and a foundation model's distinct contribution is whatever it
|
| 255 |
+
adds on the entries below.
|
| 256 |
+
|
| 257 |
## Screening and hybrid use
|
| 258 |
|
| 259 |
- **Certified screening frontier.** To pass an engineered construct as natural an adversary must
|
composition_boundary.png
ADDED
|
Git LFS Details
|