Spaces:

genomenet
/

functional-distance

Sleeping

genomenet Claude Opus 4.7 (1M context) commited on Apr 24

Commit

54f342f

1 Parent(s): 10564c4

Switch Twin to aspect-specific checkpoints with runtime switcher

Replaced the old collapsed-training Twin checkpoint with the newer BP/CC/MF
aspect-specific family (train_point_{ASPECT}_20251221_std_ft_bs32ga4), which
has real dynamic range in its distance output (cos 0.59-0.96 vs old 0.999).

- Three checkpoints hosted in one HF model repo: genomenet/twin-point-1024
- Added BP/CC/MF radio button to the Compare tab (default BP)
- Lazy-load: only one aspect cached at a time (memory constraint on cpu-basic)
- Each aspect checkpoint contains fine-tuned ESM2 backbone (~2.7 GB per file)
- Twin output dim back to 1024 (these use projection_dim=512, unlike the old one)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

README.md +14 -7
app.py +73 -40
twin_model.py +2 -1

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ Compare protein sequence embeddings from ESM2 (baseline) and Twin network (fine-
 | Model | Parameters | Embedding | Training |
 |-------|------------|-----------|----------|
 | ESM2 | 650M | 1280-dim | Pretrained on UniRef50 |
-| Twin | ESM2 + custom | 512-dim | Contrastive on Resnik GO-semantic similarity (all aspects) |
 ## Usage
@@ -37,12 +37,19 @@ env vars.
 ## Twin model
-Two-tower contrastive encoder (AA-vocab Transformer + frozen ESM2 backbone),
-trained on Resnik GO-semantic similarity. The 2.5 GB checkpoint is hosted at
-[`genomenet/twin-aa-esm-resnik-1024`](https://huggingface.co/genomenet/twin-aa-esm-resnik-1024)
-and downloaded on Space startup. Override with `TWIN_REPO_ID` /
-`TWIN_CHECKPOINT_FILE` / `TWIN_ESM_BACKBONE` env vars. Architecture code
-lives in `twin_model.py` alongside `app.py`.
 ## Acknowledgements

 | Model | Parameters | Embedding | Training |
 |-------|------------|-----------|----------|
 | ESM2 | 650M | 1280-dim | Pretrained on UniRef50 |
+| Twin | ESM2 + custom | 1024-dim | Resnik-contrastive fine-tune, one checkpoint per GO aspect (BP/CC/MF) |
 ## Usage
 ## Twin model
+Two-tower contrastive encoder (custom AA Transformer + fine-tuned ESM2 backbone),
+trained on Resnik GO-semantic similarity. **Three aspect-specific checkpoints**
+(~2.7 GB each) are hosted at
+[`genomenet/twin-point-1024`](https://huggingface.co/genomenet/twin-point-1024):
+- `bp_cp_best.pt` — Biological Process
+- `cc_cp_best.pt` — Cellular Component
+- `mf_cp_best.pt` — Molecular Function
+The Space loads one aspect at a time (cpu-basic memory budget); switching aspects
+evicts the previous model and downloads the next (~15 s from disk-cache after
+first use). Override via `TWIN_REPO_ID` / `TWIN_DEFAULT_ASPECT` / `TWIN_ESM_BACKBONE`
+env vars. Architecture code lives in `twin_model.py` alongside `app.py`.
 ## Acknowledgements

app.py CHANGED Viewed

@@ -20,7 +20,7 @@ from plotly.subplots import make_subplots
 # Model config
 ESM2_MODEL = "esm2_t33_650M_UR50D"  # 650M params, 1280-dim
 ESM2_DIM = 1280
-TWIN_DIM = 512  # 2 * projection_dim (256), two-tower concat; 1024 in the run name is seq_len
 # FAISS index config (UniRef50 GO-annotated, ESM2 650M embeddings)
 FAISS_REPO_ID = os.environ.get("FAISS_REPO_ID", "genomenet/esm2-uniref50-faiss")
@@ -29,16 +29,21 @@ FAISS_IDS_FILE = "ids.npy"
 FAISS_METADATA_FILE = "metadata.json"
 FAISS_NPROBE = int(os.environ.get("FAISS_NPROBE", "32"))
-# Twin model config (checkpoint hosted as HF model repo)
-TWIN_REPO_ID = os.environ.get("TWIN_REPO_ID", "genomenet/twin-aa-esm-resnik-1024")
-TWIN_CHECKPOINT_FILE = os.environ.get("TWIN_CHECKPOINT_FILE", "model_aa_best.pt")
 TWIN_ESM_BACKBONE = os.environ.get("TWIN_ESM_BACKBONE", "facebook/esm2_t33_650M_UR50D")
 # Model cache
 _esm2_model = None
 _esm2_alphabet = None
-_twin_model = None
-_twin_seq_len = None
 _faiss_index = None
 _faiss_ids = None
 _faiss_metadata = None
@@ -56,29 +61,47 @@ def get_esm2():
         print("ESM2 loaded.")
     return _esm2_model, _esm2_alphabet
-def get_twin():
-    """Download + load the fine-tuned Twin model (two-tower contrastive encoder)."""
-    global _twin_model, _twin_seq_len
-    if _twin_model is not None:
-        return _twin_model, _twin_seq_len
     import torch
     from huggingface_hub import hf_hub_download
     from twin_model import load_twin_model
-    print(f"Downloading Twin checkpoint from {TWIN_REPO_ID}/{TWIN_CHECKPOINT_FILE}...")
-    ckpt_path = hf_hub_download(
-        repo_id=TWIN_REPO_ID,
-        filename=TWIN_CHECKPOINT_FILE,
-        repo_type="model",
-    )
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     model, seq_len, emb_dim = load_twin_model(ckpt_path, device, TWIN_ESM_BACKBONE)
     if emb_dim != TWIN_DIM:
-        print(f"  WARN Twin output dim is {emb_dim}, expected {TWIN_DIM}")
-    _twin_model = model
-    _twin_seq_len = seq_len
-    return _twin_model, _twin_seq_len
 def get_faiss():
     """Download + load FAISS index and UniRef50 id mapping from HF dataset repo."""
@@ -154,11 +177,11 @@ def embed_esm2(sequence):
     return embedding
 @torch.no_grad()
-def embed_twin(sequence):
-    """Compute Twin embedding: concat(custom_proj, esm_proj), (TWIN_DIM,) float32."""
     from twin_model import ensure_aa_sequence, preprocess_sequences_batch
-    model, seq_len = get_twin()
     device = next(model.parameters()).device
     cleaned = ensure_aa_sequence(sequence)
     input_ids = preprocess_sequences_batch([cleaned], seq_len, device)
@@ -297,7 +320,7 @@ def create_distribution_plot(esm2_emb, twin_emb):
 # Example protein (human insulin)
 EXAMPLE_PROTEIN = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
-def process(sequence: str, top_k: int = 10):
     """Process protein sequence, compare embeddings, and search FAISS."""
     sequence = strip_fasta_header(sequence.strip())
@@ -308,7 +331,7 @@ def process(sequence: str, top_k: int = 10):
     # Compute embeddings
     esm2_emb = embed_esm2(sequence)
-    twin_emb = embed_twin(sequence)
     # Compute stats
     esm2_stats = compute_stats(esm2_emb)
@@ -325,7 +348,7 @@ def process(sequence: str, top_k: int = 10):
     summary = f"""### Results
-| | ESM2 | Twin |
 |---|---|---|
 | Dimension | {esm2_stats['dim']} | {twin_stats['dim']} |
 | L2 Norm | {esm2_stats['l2_norm']:.2f} | {twin_stats['l2_norm']:.2f} |
@@ -337,7 +360,7 @@ Sequence: {len(sequence)} aa
     # Create visualizations
     esm2_heatmap = create_embedding_heatmap(esm2_emb, "ESM2 Embedding")
-    twin_heatmap = create_embedding_heatmap(twin_emb, "Twin Embedding")
     comparison_plot = create_comparison_plot(esm2_stats, twin_stats)
     distribution_plot = create_distribution_plot(esm2_emb, twin_emb)
@@ -364,6 +387,13 @@ with gr.Blocks(
                     minimum=1, maximum=50, value=10, step=1,
                     label="Nearest neighbors (top-k)"
                 )
                 btn = gr.Button("compare embeddings", variant="primary")
                 output = gr.Markdown()
                 with gr.Row():
@@ -376,7 +406,7 @@ with gr.Blocks(
         with gr.Row():
             esm2_heatmap = gr.Plot(label="ESM2 Embedding (1280-dim)")
-            twin_heatmap = gr.Plot(label="Twin Embedding (512-dim)")
         gr.Markdown("### Nearest UniRef50 neighbors (ESM2 embedding, cosine)")
         hits_table = gr.Dataframe(
@@ -388,7 +418,7 @@ with gr.Blocks(
     btn.click(
         process,
-        inputs=[seq_input, top_k_slider],
         outputs=[output, esm2_download, twin_download, esm2_heatmap, twin_heatmap, comparison_plot, distribution_plot, hits_table],
         api_name="compare"
     )
@@ -405,12 +435,13 @@ client = Client("genomenet/functional-distance")
 result = client.predict(
     sequence="MALWMRLLPLLALLALWG...",  # protein sequence
     top_k=10,
     api_name="/compare"
 )
 summary, esm2_path, twin_path, *plots, hits = result
 esm2_emb = np.load(esm2_path)   # (1280,)
-twin_emb = np.load(twin_path)   # (512,)
 # hits: DataFrame with columns [rank, uniref50_id, cosine, uniprot]
 ```
@@ -426,7 +457,7 @@ on UniProt.
 | Model | Dimension | Description |
 |-------|-----------|-------------|
 | ESM2 | 1280 | `esm2_t33_650M_UR50D` pretrained on UniRef50 |
-| Twin | 512 | Fine-tuned on Gene Ontology annotations |
 ### Comparison
@@ -444,11 +475,13 @@ this GO supervision.
 - Pretrained on UniRef50 with masked language modeling
 - General-purpose protein representation
-**Twin Network** (`aa_esm_resnik_1024_contrastive_padding`):
-- Two-tower contrastive encoder (AA-vocab Transformer + frozen ESM2 backbone)
-- Trained on Resnik GO-semantic similarity across all GO aspects (MF, BP, CC combined)
-- Output: `concat(custom_proj, esm_proj)` → 512-dim embedding optimized for
-  functional similarity (GO-coherent nearest neighbors)
 ### Gene Ontology
@@ -476,10 +509,10 @@ if __name__ == "__main__":
         _ = get_faiss()
     except Exception as e:
         print(f"FAISS load failed (will retry on first request): {e}")
-    print(f"Loading Twin model from {TWIN_REPO_ID}...")
     try:
-        _ = get_twin()
-        print("Twin ready!")
     except Exception as e:
         print(f"Twin load failed (will retry on first request): {e}")
     demo.launch(

 # Model config
 ESM2_MODEL = "esm2_t33_650M_UR50D"  # 650M params, 1280-dim
 ESM2_DIM = 1280
+TWIN_DIM = 1024  # 2 * projection_dim (512), two-tower concat
 # FAISS index config (UniRef50 GO-annotated, ESM2 650M embeddings)
 FAISS_REPO_ID = os.environ.get("FAISS_REPO_ID", "genomenet/esm2-uniref50-faiss")
 FAISS_METADATA_FILE = "metadata.json"
 FAISS_NPROBE = int(os.environ.get("FAISS_NPROBE", "32"))
+# Twin model config (3 aspect-specific checkpoints in one HF model repo)
+TWIN_REPO_ID = os.environ.get("TWIN_REPO_ID", "genomenet/twin-point-1024")
+TWIN_CHECKPOINT_FILES = {
+    "BP": "bp_cp_best.pt",  # Biological Process
+    "CC": "cc_cp_best.pt",  # Cellular Component
+    "MF": "mf_cp_best.pt",  # Molecular Function
+}
+TWIN_DEFAULT_ASPECT = os.environ.get("TWIN_DEFAULT_ASPECT", "BP")
 TWIN_ESM_BACKBONE = os.environ.get("TWIN_ESM_BACKBONE", "facebook/esm2_t33_650M_UR50D")
 # Model cache
 _esm2_model = None
 _esm2_alphabet = None
+# Only one aspect cached at a time (each Twin is ~2.7 GB on CPU, can't fit all 3 on a cpu-basic Space)
+_twin_cache = {"aspect": None, "model": None, "seq_len": None}
 _faiss_index = None
 _faiss_ids = None
 _faiss_metadata = None
         print("ESM2 loaded.")
     return _esm2_model, _esm2_alphabet
+def get_twin(aspect=None):
+    """Download + load the fine-tuned Twin model for the requested GO aspect.
+    Only one aspect is kept in memory at a time — switching aspects evicts the
+    previous model (each is ~2.7 GB; three won't fit on cpu-basic).
+    """
+    global _twin_cache
+    aspect = (aspect or TWIN_DEFAULT_ASPECT).upper()
+    if aspect not in TWIN_CHECKPOINT_FILES:
+        raise ValueError(f"Unknown aspect {aspect!r}; expected one of {list(TWIN_CHECKPOINT_FILES)}")
+    if _twin_cache["aspect"] == aspect and _twin_cache["model"] is not None:
+        return _twin_cache["model"], _twin_cache["seq_len"]
+    import gc
     import torch
     from huggingface_hub import hf_hub_download
     from twin_model import load_twin_model
+    # Evict any previously loaded aspect to free ~2.7 GB before loading the next.
+    if _twin_cache["model"] is not None:
+        print(f"Evicting Twin/{_twin_cache['aspect']} to load Twin/{aspect}...")
+        _twin_cache["model"] = None
+        _twin_cache["seq_len"] = None
+        _twin_cache["aspect"] = None
+        gc.collect()
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    filename = TWIN_CHECKPOINT_FILES[aspect]
+    print(f"Downloading Twin/{aspect} checkpoint ({filename}) from {TWIN_REPO_ID}...")
+    ckpt_path = hf_hub_download(repo_id=TWIN_REPO_ID, filename=filename, repo_type="model")
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     model, seq_len, emb_dim = load_twin_model(ckpt_path, device, TWIN_ESM_BACKBONE)
     if emb_dim != TWIN_DIM:
+        print(f"  WARN Twin/{aspect} output dim is {emb_dim}, expected {TWIN_DIM}")
+    _twin_cache["aspect"] = aspect
+    _twin_cache["model"] = model
+    _twin_cache["seq_len"] = seq_len
+    return model, seq_len
 def get_faiss():
     """Download + load FAISS index and UniRef50 id mapping from HF dataset repo."""
     return embedding
 @torch.no_grad()
+def embed_twin(sequence, aspect=None):
+    """Compute Twin embedding for the given GO aspect (BP/CC/MF)."""
     from twin_model import ensure_aa_sequence, preprocess_sequences_batch
+    model, seq_len = get_twin(aspect)
     device = next(model.parameters()).device
     cleaned = ensure_aa_sequence(sequence)
     input_ids = preprocess_sequences_batch([cleaned], seq_len, device)
 # Example protein (human insulin)
 EXAMPLE_PROTEIN = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
+def process(sequence: str, top_k: int = 10, twin_aspect: str = "BP"):
     """Process protein sequence, compare embeddings, and search FAISS."""
     sequence = strip_fasta_header(sequence.strip())
     # Compute embeddings
     esm2_emb = embed_esm2(sequence)
+    twin_emb = embed_twin(sequence, aspect=twin_aspect)
     # Compute stats
     esm2_stats = compute_stats(esm2_emb)
     summary = f"""### Results
+| | ESM2 | Twin ({twin_aspect}) |
 |---|---|---|
 | Dimension | {esm2_stats['dim']} | {twin_stats['dim']} |
 | L2 Norm | {esm2_stats['l2_norm']:.2f} | {twin_stats['l2_norm']:.2f} |
     # Create visualizations
     esm2_heatmap = create_embedding_heatmap(esm2_emb, "ESM2 Embedding")
+    twin_heatmap = create_embedding_heatmap(twin_emb, f"Twin Embedding ({twin_aspect})")
     comparison_plot = create_comparison_plot(esm2_stats, twin_stats)
     distribution_plot = create_distribution_plot(esm2_emb, twin_emb)
                     minimum=1, maximum=50, value=10, step=1,
                     label="Nearest neighbors (top-k)"
                 )
+                twin_aspect_radio = gr.Radio(
+                    choices=["BP", "CC", "MF"],
+                    value=TWIN_DEFAULT_ASPECT,
+                    label="Twin GO aspect",
+                    info="Biological Process (BP), Cellular Component (CC), or Molecular Function (MF). "
+                         "First switch loads the aspect's checkpoint (~15 s)."
+                )
                 btn = gr.Button("compare embeddings", variant="primary")
                 output = gr.Markdown()
                 with gr.Row():
         with gr.Row():
             esm2_heatmap = gr.Plot(label="ESM2 Embedding (1280-dim)")
+            twin_heatmap = gr.Plot(label="Twin Embedding (1024-dim)")
         gr.Markdown("### Nearest UniRef50 neighbors (ESM2 embedding, cosine)")
         hits_table = gr.Dataframe(
     btn.click(
         process,
+        inputs=[seq_input, top_k_slider, twin_aspect_radio],
         outputs=[output, esm2_download, twin_download, esm2_heatmap, twin_heatmap, comparison_plot, distribution_plot, hits_table],
         api_name="compare"
     )
 result = client.predict(
     sequence="MALWMRLLPLLALLALWG...",  # protein sequence
     top_k=10,
+    twin_aspect="BP",  # "BP" | "CC" | "MF"
     api_name="/compare"
 )
 summary, esm2_path, twin_path, *plots, hits = result
 esm2_emb = np.load(esm2_path)   # (1280,)
+twin_emb = np.load(twin_path)   # (1024,)
 # hits: DataFrame with columns [rank, uniref50_id, cosine, uniprot]
 ```
 | Model | Dimension | Description |
 |-------|-----------|-------------|
 | ESM2 | 1280 | `esm2_t33_650M_UR50D` pretrained on UniRef50 |
+| Twin | 1024 | Resnik-contrastive fine-tune; one checkpoint per GO aspect (BP/CC/MF) |
 ### Comparison
 - Pretrained on UniRef50 with masked language modeling
 - General-purpose protein representation
+**Twin Network** (`train_point_{BP,CC,MF}_20251221_std_ft_bs32ga4`):
+- Two-tower contrastive encoder: custom AA Transformer + **fine-tuned** ESM2 backbone
+- **One checkpoint per GO aspect**: Biological Process (BP), Cellular Component (CC),
+  Molecular Function (MF). Pick aspect via the Twin GO aspect radio button.
+- Trained on Resnik GO-semantic similarity within each aspect
+- Output: `concat(custom_proj, esm_proj)` → 1024-dim; L2 distance on L2-normalized
+  embeddings ≈ functional distance in that aspect
 ### Gene Ontology
         _ = get_faiss()
     except Exception as e:
         print(f"FAISS load failed (will retry on first request): {e}")
+    print(f"Loading default Twin aspect ({TWIN_DEFAULT_ASPECT}) from {TWIN_REPO_ID}...")
     try:
+        _ = get_twin(TWIN_DEFAULT_ASPECT)
+        print(f"Twin/{TWIN_DEFAULT_ASPECT} ready!")
     except Exception as e:
         print(f"Twin load failed (will retry on first request): {e}")
     demo.launch(

twin_model.py CHANGED Viewed

@@ -17,7 +17,8 @@ similarity (folder name: `aa_esm_resnik_1024_contrastive_padding_1gpu_old`):
 Final embedding = concat(custom_proj, esm_proj) with shape (2 * projection_dim,)
 Output size is `2 * projection_dim`, read from the checkpoint's `args` at load time.
-For the `model_aa_best.pt` checkpoint this is **512** (projection_dim=256).
 """
 from __future__ import annotations

 Final embedding = concat(custom_proj, esm_proj) with shape (2 * projection_dim,)
 Output size is `2 * projection_dim`, read from the checkpoint's `args` at load time.
+For the `train_point_{BP,CC,MF}_20251221_std_ft_bs32ga4/cp_best.pt` checkpoints
+(current default family) this is **1024** (projection_dim=512).
 """
 from __future__ import annotations