genomenet Claude Opus 4.7 (1M context) commited on
Commit
29d65cb
·
1 Parent(s): 9907e62

Benchmark tab: add Cas-only Pfam-verified reference section

Browse files

Adds a second benchmark block below the existing Cas+Acr mixed-reference
results, using the 175-protein Cas-only reference built from UniProt
Pfam-domain xrefs (scripts/analyses/crispr/build_cas_reference_pfam.py).

Pfam-based fetch eliminates name-collision false positives that dogged
the earlier reference (acriflavin resistance proteins matching AcrIF*,
Type-III secretion HrpB matching broad PF09502, etc). Each family is
pinned to a verified unique-to-that-family Pfam domain; coverage
mixes SwissProt + TrEMBL annotation_score:5.

Three panels: UMAP side-by-side, per-family silhouette, four-way
aspect comparison + results table.

Headline on Cas-only:
- Twin-MF wins leave-one-out top-1 (0.874 vs ESM2 0.834, +0.04) and
MRR (0.911 vs 0.887). Annotation-quality is better with Twin-MF.
- Twin-BP leads 5-NN precision and AUC same-family by a hair.
- ESM2 wins silhouette on this set because all 175 are Cas, so its
narrow cosine range gives clean relative geometry; silhouette is
task-dependent in a way that retrieval metrics are not.
- Twin-CC still worst on retrieval; do not use CC for Cas work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

app.py CHANGED
@@ -1792,6 +1792,73 @@ with gr.Blocks(
1792
  "pairwise cosines, which is confounded by compressed-range effects on this dataset."
1793
  )
1794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1795
  with gr.Tab("API"):
1796
  gr.Markdown("""
1797
  ### API
 
1792
  "pairwise cosines, which is confounded by compressed-range effects on this dataset."
1793
  )
1794
 
1795
+ # -----------------------------------------------------------------
1796
+ # 2. Cas-only Pfam-verified reference (175 proteins, Cas1-Cas12)
1797
+ # -----------------------------------------------------------------
1798
+ gr.Markdown(
1799
+ "---\n"
1800
+ "## Cas-only benchmark — Pfam-verified reference (175 proteins)\n\n"
1801
+ "The benchmark above uses a mixed Cas + anti-CRISPR reference. To isolate "
1802
+ "**within-Cas subfamily resolution** (the harder signal — all proteins are "
1803
+ "functionally related, we're asking whether the model can tell Cas1 from Cas9 "
1804
+ "from Cas13) we built a separate Cas-only reference from **UniProt Pfam-domain "
1805
+ "xrefs** covering Cas1 through Cas12. Using Pfam IDs (e.g. `PF01867` for Cas1, "
1806
+ "`PF18019` for Cas3-HD, `PF09703` for Cas8) instead of protein-name strings "
1807
+ "eliminates name-collision false positives (e.g. acriflavin resistance proteins "
1808
+ "matching `AcrIF*` queries, Type-III secretion HrpB matching broad `PF09502`). "
1809
+ "The reference contains **175 proteins** across 12 Cas families, each verified "
1810
+ "by Pfam domain membership; mix of SwissProt (high-curation) and TrEMBL "
1811
+ "`annotation_score:5` (top TrEMBL tier).\n\n"
1812
+ "Per-family counts: Cas1=20, Cas2=14, Cas3=15, Cas4=11, Cas5=15, Cas6=15, "
1813
+ "Cas7=15, Cas8=15, Cas9=15, Cas10=15, Cas11=10, Cas12=15."
1814
+ )
1815
+
1816
+ gr.Markdown("### Cas-only · UMAP (ESM2 left, Twin-MF right)")
1817
+ gr.Image(
1818
+ value="data/benchmark/cas_pfam/crispr_umap_esm2_vs_twin.png",
1819
+ label=None, show_label=False, container=False,
1820
+ )
1821
+
1822
+ gr.Markdown("### Cas-only · Silhouette per family")
1823
+ gr.Image(
1824
+ value="data/benchmark/cas_pfam/crispr_silhouette.png",
1825
+ label=None, show_label=False, container=False,
1826
+ )
1827
+
1828
+ gr.Markdown("### Cas-only · four-way model comparison")
1829
+ gr.Image(
1830
+ value="data/benchmark/cas_pfam/crispr_aspect_comparison.png",
1831
+ label=None, show_label=False, container=False,
1832
+ )
1833
+ gr.Markdown(
1834
+ "| model | within − between | 5-NN precision | LOO top-1 | LOO MRR | AUC family | silhouette |\n"
1835
+ "|---|---|---|---|---|---|---|\n"
1836
+ "| ESM2 baseline | 0.033 | 0.763 | 0.834 | 0.887 | 0.798 | **0.221** |\n"
1837
+ "| Twin-BP | 0.171 | **0.802** | 0.851 | 0.896 | **0.842** | 0.206 |\n"
1838
+ "| Twin-CC | **0.407** | 0.610 | 0.749 | 0.824 | 0.791 | 0.093 |\n"
1839
+ "| **Twin-MF** | 0.274 | 0.759 | **0.874** | **0.911** | 0.820 | 0.192 |\n\n"
1840
+ "### Interpretation — what changes relative to the Cas + Acr benchmark\n\n"
1841
+ "- **Differences between methods shrink.** All 175 proteins are Cas, so the easy "
1842
+ "wins from \"separate Cas from Acr\" are gone. What remains is subfamily "
1843
+ "resolution (Cas1 vs Cas9 vs Cas13), which is a harder task.\n"
1844
+ "- **Twin-MF wins the retrieval metrics that matter for annotation:** leave-one-out "
1845
+ "top-1 family (0.874 vs ESM2 0.834 · +0.04) and LOO MRR (0.911 vs 0.887 · +0.024). "
1846
+ "Holding out any Cas protein and asking \"what family is this?\" — Twin-MF gets "
1847
+ "it right more often than every other method.\n"
1848
+ "- **Twin-BP leads AUC same-family and 5-NN precision.** For coarser "
1849
+ "neighbour-retrieval BP is comparable to or slightly better than MF.\n"
1850
+ "- **ESM2 wins silhouette (+0.22 vs Twin-MF +0.19).** Silhouette is a ratio, and "
1851
+ "when every cluster is a Cas subfamily (no Acrs to separate), ESM2's narrow-but-"
1852
+ "consistent cosine range gives clean geometry. Not a contradiction with the Cas+Acr "
1853
+ "finding — just a different task.\n"
1854
+ "- **Twin-CC is the worst performer on every retrieval metric here**, matching the "
1855
+ "Cas+Acr finding. Cellular-component GO labels don't inform Cas subfamily "
1856
+ "discrimination.\n\n"
1857
+ "**Reference build**: see "
1858
+ "`scripts/analyses/crispr/build_cas_reference_pfam.py` "
1859
+ "and `scripts/analyses/crispr/submit_embed_pfam.sh`."
1860
+ )
1861
+
1862
  with gr.Tab("API"):
1863
  gr.Markdown("""
1864
  ### API
data/benchmark/cas_pfam/crispr_aspect_comparison.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ model,dim,within_minus_between,AUC_same_family,AUC_same_group,5NN_family_precision,5NN_family_recall_adjusted,LOO_top1_family,LOO_MRR_family,n_eval_proteins
2
+ ESM2 baseline,1280,0.03256771465142568,0.7979433043700397,,0.7634285714285716,0.7634285714285716,0.8342857142857143,0.8873040610183466,175
3
+ Twin-BP,1024,0.17091611276070276,0.8419476967034808,,0.8022857142857142,0.8022857142857142,0.8514285714285714,0.895696169816989,175
4
+ Twin-CC,1024,0.4065659927825133,0.7912301177082669,,0.6102857142857143,0.6102857142857143,0.7485714285714286,0.8236681599991083,175
5
+ Twin-MF,1024,0.2736806819836299,0.820526912750563,,0.7588571428571429,0.7588571428571429,0.8742857142857143,0.9110980159863191,175
data/benchmark/cas_pfam/crispr_aspect_comparison.png ADDED

Git LFS Details

  • SHA256: d3a7aae278d6858b91fe08d69c13617533ed454f6b03bc3a1d103a0dcb8129a7
  • Pointer size: 131 Bytes
  • Size of remote file: 180 kB
data/benchmark/cas_pfam/crispr_silhouette.csv ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model,family,n,mean_silhouette
2
+ ESM2 baseline,Cas1,20,0.12765653431415558
3
+ ESM2 baseline,Cas10,15,0.12854169309139252
4
+ ESM2 baseline,Cas11,10,0.971772313117981
5
+ ESM2 baseline,Cas12,15,0.189345583319664
6
+ ESM2 baseline,Cas2,14,0.4075210988521576
7
+ ESM2 baseline,Cas3,15,-0.13315853476524353
8
+ ESM2 baseline,Cas4,11,-0.22660766541957855
9
+ ESM2 baseline,Cas5,15,0.16992774605751038
10
+ ESM2 baseline,Cas6,15,-0.1849641352891922
11
+ ESM2 baseline,Cas7,15,0.029301460832357407
12
+ ESM2 baseline,Cas8,15,0.8048871159553528
13
+ ESM2 baseline,Cas9,15,0.5430005192756653
14
+ Twin-BP,Cas1,20,0.12820157408714294
15
+ Twin-BP,Cas10,15,0.35021838545799255
16
+ Twin-BP,Cas11,10,0.9735381007194519
17
+ Twin-BP,Cas12,15,0.1318134367465973
18
+ Twin-BP,Cas2,14,0.5606688857078552
19
+ Twin-BP,Cas3,15,-0.2925216853618622
20
+ Twin-BP,Cas4,11,0.10622821003198624
21
+ Twin-BP,Cas5,15,0.3723158538341522
22
+ Twin-BP,Cas6,15,-0.3041587173938751
23
+ Twin-BP,Cas7,15,-0.023406527936458588
24
+ Twin-BP,Cas8,15,-0.056025028228759766
25
+ Twin-BP,Cas9,15,0.8021896481513977
26
+ Twin-CC,Cas1,20,-0.28027772903442383
27
+ Twin-CC,Cas10,15,-0.0030324982944875956
28
+ Twin-CC,Cas11,10,0.9918010830879211
29
+ Twin-CC,Cas12,15,0.004344145301729441
30
+ Twin-CC,Cas2,14,0.2784954607486725
31
+ Twin-CC,Cas3,15,-0.19620533287525177
32
+ Twin-CC,Cas4,11,-0.2506910264492035
33
+ Twin-CC,Cas5,15,0.05506397783756256
34
+ Twin-CC,Cas6,15,0.1116006001830101
35
+ Twin-CC,Cas7,15,-0.0873374417424202
36
+ Twin-CC,Cas8,15,0.3358798623085022
37
+ Twin-CC,Cas9,15,0.5031325817108154
38
+ Twin-MF,Cas1,20,0.039285123348236084
39
+ Twin-MF,Cas10,15,0.5092483162879944
40
+ Twin-MF,Cas11,10,0.9898713231086731
41
+ Twin-MF,Cas12,15,-0.13932843506336212
42
+ Twin-MF,Cas2,14,0.6767507195472717
43
+ Twin-MF,Cas3,15,0.041006673127412796
44
+ Twin-MF,Cas4,11,0.08107656985521317
45
+ Twin-MF,Cas5,15,0.27881568670272827
46
+ Twin-MF,Cas6,15,-0.21472005546092987
47
+ Twin-MF,Cas7,15,-0.27227726578712463
48
+ Twin-MF,Cas8,15,-0.20869775116443634
49
+ Twin-MF,Cas9,15,0.8472511768341064
data/benchmark/cas_pfam/crispr_silhouette.png ADDED

Git LFS Details

  • SHA256: 81638f91c355d59139085a552bda7a20f6d8aa548656c42fdaf17e2fe867bf7a
  • Pointer size: 131 Bytes
  • Size of remote file: 262 kB
data/benchmark/cas_pfam/crispr_silhouette_global.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ model,mean_silhouette,n_proteins
2
+ ESM2 baseline,0.22106678783893585,175
3
+ Twin-BP,0.2058495283126831,175
4
+ Twin-CC,0.09317418932914734,175
5
+ Twin-MF,0.19240137934684753,175
data/benchmark/cas_pfam/crispr_umap_esm2_vs_twin.png ADDED

Git LFS Details

  • SHA256: 5f46811afc5543d9481c1d053e26d671b48431e633c8945c59cf0e7890783ddd
  • Pointer size: 131 Bytes
  • Size of remote file: 279 kB