Spaces:
Sleeping
Benchmark tab: add Cas-only Pfam-verified reference section
Browse filesAdds a second benchmark block below the existing Cas+Acr mixed-reference
results, using the 175-protein Cas-only reference built from UniProt
Pfam-domain xrefs (scripts/analyses/crispr/build_cas_reference_pfam.py).
Pfam-based fetch eliminates name-collision false positives that dogged
the earlier reference (acriflavin resistance proteins matching AcrIF*,
Type-III secretion HrpB matching broad PF09502, etc). Each family is
pinned to a verified unique-to-that-family Pfam domain; coverage
mixes SwissProt + TrEMBL annotation_score:5.
Three panels: UMAP side-by-side, per-family silhouette, four-way
aspect comparison + results table.
Headline on Cas-only:
- Twin-MF wins leave-one-out top-1 (0.874 vs ESM2 0.834, +0.04) and
MRR (0.911 vs 0.887). Annotation-quality is better with Twin-MF.
- Twin-BP leads 5-NN precision and AUC same-family by a hair.
- ESM2 wins silhouette on this set because all 175 are Cas, so its
narrow cosine range gives clean relative geometry; silhouette is
task-dependent in a way that retrieval metrics are not.
- Twin-CC still worst on retrieval; do not use CC for Cas work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- app.py +67 -0
- data/benchmark/cas_pfam/crispr_aspect_comparison.csv +5 -0
- data/benchmark/cas_pfam/crispr_aspect_comparison.png +3 -0
- data/benchmark/cas_pfam/crispr_silhouette.csv +49 -0
- data/benchmark/cas_pfam/crispr_silhouette.png +3 -0
- data/benchmark/cas_pfam/crispr_silhouette_global.csv +5 -0
- data/benchmark/cas_pfam/crispr_umap_esm2_vs_twin.png +3 -0
|
@@ -1792,6 +1792,73 @@ with gr.Blocks(
|
|
| 1792 |
"pairwise cosines, which is confounded by compressed-range effects on this dataset."
|
| 1793 |
)
|
| 1794 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1795 |
with gr.Tab("API"):
|
| 1796 |
gr.Markdown("""
|
| 1797 |
### API
|
|
|
|
| 1792 |
"pairwise cosines, which is confounded by compressed-range effects on this dataset."
|
| 1793 |
)
|
| 1794 |
|
| 1795 |
+
# -----------------------------------------------------------------
|
| 1796 |
+
# 2. Cas-only Pfam-verified reference (175 proteins, Cas1-Cas12)
|
| 1797 |
+
# -----------------------------------------------------------------
|
| 1798 |
+
gr.Markdown(
|
| 1799 |
+
"---\n"
|
| 1800 |
+
"## Cas-only benchmark — Pfam-verified reference (175 proteins)\n\n"
|
| 1801 |
+
"The benchmark above uses a mixed Cas + anti-CRISPR reference. To isolate "
|
| 1802 |
+
"**within-Cas subfamily resolution** (the harder signal — all proteins are "
|
| 1803 |
+
"functionally related, we're asking whether the model can tell Cas1 from Cas9 "
|
| 1804 |
+
"from Cas13) we built a separate Cas-only reference from **UniProt Pfam-domain "
|
| 1805 |
+
"xrefs** covering Cas1 through Cas12. Using Pfam IDs (e.g. `PF01867` for Cas1, "
|
| 1806 |
+
"`PF18019` for Cas3-HD, `PF09703` for Cas8) instead of protein-name strings "
|
| 1807 |
+
"eliminates name-collision false positives (e.g. acriflavin resistance proteins "
|
| 1808 |
+
"matching `AcrIF*` queries, Type-III secretion HrpB matching broad `PF09502`). "
|
| 1809 |
+
"The reference contains **175 proteins** across 12 Cas families, each verified "
|
| 1810 |
+
"by Pfam domain membership; mix of SwissProt (high-curation) and TrEMBL "
|
| 1811 |
+
"`annotation_score:5` (top TrEMBL tier).\n\n"
|
| 1812 |
+
"Per-family counts: Cas1=20, Cas2=14, Cas3=15, Cas4=11, Cas5=15, Cas6=15, "
|
| 1813 |
+
"Cas7=15, Cas8=15, Cas9=15, Cas10=15, Cas11=10, Cas12=15."
|
| 1814 |
+
)
|
| 1815 |
+
|
| 1816 |
+
gr.Markdown("### Cas-only · UMAP (ESM2 left, Twin-MF right)")
|
| 1817 |
+
gr.Image(
|
| 1818 |
+
value="data/benchmark/cas_pfam/crispr_umap_esm2_vs_twin.png",
|
| 1819 |
+
label=None, show_label=False, container=False,
|
| 1820 |
+
)
|
| 1821 |
+
|
| 1822 |
+
gr.Markdown("### Cas-only · Silhouette per family")
|
| 1823 |
+
gr.Image(
|
| 1824 |
+
value="data/benchmark/cas_pfam/crispr_silhouette.png",
|
| 1825 |
+
label=None, show_label=False, container=False,
|
| 1826 |
+
)
|
| 1827 |
+
|
| 1828 |
+
gr.Markdown("### Cas-only · four-way model comparison")
|
| 1829 |
+
gr.Image(
|
| 1830 |
+
value="data/benchmark/cas_pfam/crispr_aspect_comparison.png",
|
| 1831 |
+
label=None, show_label=False, container=False,
|
| 1832 |
+
)
|
| 1833 |
+
gr.Markdown(
|
| 1834 |
+
"| model | within − between | 5-NN precision | LOO top-1 | LOO MRR | AUC family | silhouette |\n"
|
| 1835 |
+
"|---|---|---|---|---|---|---|\n"
|
| 1836 |
+
"| ESM2 baseline | 0.033 | 0.763 | 0.834 | 0.887 | 0.798 | **0.221** |\n"
|
| 1837 |
+
"| Twin-BP | 0.171 | **0.802** | 0.851 | 0.896 | **0.842** | 0.206 |\n"
|
| 1838 |
+
"| Twin-CC | **0.407** | 0.610 | 0.749 | 0.824 | 0.791 | 0.093 |\n"
|
| 1839 |
+
"| **Twin-MF** | 0.274 | 0.759 | **0.874** | **0.911** | 0.820 | 0.192 |\n\n"
|
| 1840 |
+
"### Interpretation — what changes relative to the Cas + Acr benchmark\n\n"
|
| 1841 |
+
"- **Differences between methods shrink.** All 175 proteins are Cas, so the easy "
|
| 1842 |
+
"wins from \"separate Cas from Acr\" are gone. What remains is subfamily "
|
| 1843 |
+
"resolution (Cas1 vs Cas9 vs Cas13), which is a harder task.\n"
|
| 1844 |
+
"- **Twin-MF wins the retrieval metrics that matter for annotation:** leave-one-out "
|
| 1845 |
+
"top-1 family (0.874 vs ESM2 0.834 · +0.04) and LOO MRR (0.911 vs 0.887 · +0.024). "
|
| 1846 |
+
"Holding out any Cas protein and asking \"what family is this?\" — Twin-MF gets "
|
| 1847 |
+
"it right more often than every other method.\n"
|
| 1848 |
+
"- **Twin-BP leads AUC same-family and 5-NN precision.** For coarser "
|
| 1849 |
+
"neighbour-retrieval BP is comparable to or slightly better than MF.\n"
|
| 1850 |
+
"- **ESM2 wins silhouette (+0.22 vs Twin-MF +0.19).** Silhouette is a ratio, and "
|
| 1851 |
+
"when every cluster is a Cas subfamily (no Acrs to separate), ESM2's narrow-but-"
|
| 1852 |
+
"consistent cosine range gives clean geometry. Not a contradiction with the Cas+Acr "
|
| 1853 |
+
"finding — just a different task.\n"
|
| 1854 |
+
"- **Twin-CC is the worst performer on every retrieval metric here**, matching the "
|
| 1855 |
+
"Cas+Acr finding. Cellular-component GO labels don't inform Cas subfamily "
|
| 1856 |
+
"discrimination.\n\n"
|
| 1857 |
+
"**Reference build**: see "
|
| 1858 |
+
"`scripts/analyses/crispr/build_cas_reference_pfam.py` "
|
| 1859 |
+
"and `scripts/analyses/crispr/submit_embed_pfam.sh`."
|
| 1860 |
+
)
|
| 1861 |
+
|
| 1862 |
with gr.Tab("API"):
|
| 1863 |
gr.Markdown("""
|
| 1864 |
### API
|
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model,dim,within_minus_between,AUC_same_family,AUC_same_group,5NN_family_precision,5NN_family_recall_adjusted,LOO_top1_family,LOO_MRR_family,n_eval_proteins
|
| 2 |
+
ESM2 baseline,1280,0.03256771465142568,0.7979433043700397,,0.7634285714285716,0.7634285714285716,0.8342857142857143,0.8873040610183466,175
|
| 3 |
+
Twin-BP,1024,0.17091611276070276,0.8419476967034808,,0.8022857142857142,0.8022857142857142,0.8514285714285714,0.895696169816989,175
|
| 4 |
+
Twin-CC,1024,0.4065659927825133,0.7912301177082669,,0.6102857142857143,0.6102857142857143,0.7485714285714286,0.8236681599991083,175
|
| 5 |
+
Twin-MF,1024,0.2736806819836299,0.820526912750563,,0.7588571428571429,0.7588571428571429,0.8742857142857143,0.9110980159863191,175
|
|
Git LFS Details
|
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model,family,n,mean_silhouette
|
| 2 |
+
ESM2 baseline,Cas1,20,0.12765653431415558
|
| 3 |
+
ESM2 baseline,Cas10,15,0.12854169309139252
|
| 4 |
+
ESM2 baseline,Cas11,10,0.971772313117981
|
| 5 |
+
ESM2 baseline,Cas12,15,0.189345583319664
|
| 6 |
+
ESM2 baseline,Cas2,14,0.4075210988521576
|
| 7 |
+
ESM2 baseline,Cas3,15,-0.13315853476524353
|
| 8 |
+
ESM2 baseline,Cas4,11,-0.22660766541957855
|
| 9 |
+
ESM2 baseline,Cas5,15,0.16992774605751038
|
| 10 |
+
ESM2 baseline,Cas6,15,-0.1849641352891922
|
| 11 |
+
ESM2 baseline,Cas7,15,0.029301460832357407
|
| 12 |
+
ESM2 baseline,Cas8,15,0.8048871159553528
|
| 13 |
+
ESM2 baseline,Cas9,15,0.5430005192756653
|
| 14 |
+
Twin-BP,Cas1,20,0.12820157408714294
|
| 15 |
+
Twin-BP,Cas10,15,0.35021838545799255
|
| 16 |
+
Twin-BP,Cas11,10,0.9735381007194519
|
| 17 |
+
Twin-BP,Cas12,15,0.1318134367465973
|
| 18 |
+
Twin-BP,Cas2,14,0.5606688857078552
|
| 19 |
+
Twin-BP,Cas3,15,-0.2925216853618622
|
| 20 |
+
Twin-BP,Cas4,11,0.10622821003198624
|
| 21 |
+
Twin-BP,Cas5,15,0.3723158538341522
|
| 22 |
+
Twin-BP,Cas6,15,-0.3041587173938751
|
| 23 |
+
Twin-BP,Cas7,15,-0.023406527936458588
|
| 24 |
+
Twin-BP,Cas8,15,-0.056025028228759766
|
| 25 |
+
Twin-BP,Cas9,15,0.8021896481513977
|
| 26 |
+
Twin-CC,Cas1,20,-0.28027772903442383
|
| 27 |
+
Twin-CC,Cas10,15,-0.0030324982944875956
|
| 28 |
+
Twin-CC,Cas11,10,0.9918010830879211
|
| 29 |
+
Twin-CC,Cas12,15,0.004344145301729441
|
| 30 |
+
Twin-CC,Cas2,14,0.2784954607486725
|
| 31 |
+
Twin-CC,Cas3,15,-0.19620533287525177
|
| 32 |
+
Twin-CC,Cas4,11,-0.2506910264492035
|
| 33 |
+
Twin-CC,Cas5,15,0.05506397783756256
|
| 34 |
+
Twin-CC,Cas6,15,0.1116006001830101
|
| 35 |
+
Twin-CC,Cas7,15,-0.0873374417424202
|
| 36 |
+
Twin-CC,Cas8,15,0.3358798623085022
|
| 37 |
+
Twin-CC,Cas9,15,0.5031325817108154
|
| 38 |
+
Twin-MF,Cas1,20,0.039285123348236084
|
| 39 |
+
Twin-MF,Cas10,15,0.5092483162879944
|
| 40 |
+
Twin-MF,Cas11,10,0.9898713231086731
|
| 41 |
+
Twin-MF,Cas12,15,-0.13932843506336212
|
| 42 |
+
Twin-MF,Cas2,14,0.6767507195472717
|
| 43 |
+
Twin-MF,Cas3,15,0.041006673127412796
|
| 44 |
+
Twin-MF,Cas4,11,0.08107656985521317
|
| 45 |
+
Twin-MF,Cas5,15,0.27881568670272827
|
| 46 |
+
Twin-MF,Cas6,15,-0.21472005546092987
|
| 47 |
+
Twin-MF,Cas7,15,-0.27227726578712463
|
| 48 |
+
Twin-MF,Cas8,15,-0.20869775116443634
|
| 49 |
+
Twin-MF,Cas9,15,0.8472511768341064
|
|
Git LFS Details
|
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model,mean_silhouette,n_proteins
|
| 2 |
+
ESM2 baseline,0.22106678783893585,175
|
| 3 |
+
Twin-BP,0.2058495283126831,175
|
| 4 |
+
Twin-CC,0.09317418932914734,175
|
| 5 |
+
Twin-MF,0.19240137934684753,175
|
|
Git LFS Details
|