qinghuiwan
/

structural-isomorphism-v2-expanded

@@ -12,28 +12,50 @@ tags:
 base_model: shibing624/text2vec-base-chinese
 ---
-# structural-isomorphism/structural-v1
-A sentence-transformer model fine-tuned for **structural similarity** -- recognizing that phenomena from completely different domains share the same underlying structure.
-Unlike standard semantic similarity models that match by surface vocabulary, this model maps descriptions with the same structural pattern close together in embedding space, regardless of domain.
 ## Model Description
 - **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
-- **Training data**: [SIBD](https://huggingface.co/datasets/structural-isomorphism/SIBD) -- 1,214 descriptions across 84 structural types
 - **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
-- **Epochs**: 10 | **Batch size**: 16 | **Learning rate**: 2e-5 | **Warmup**: 10%
 ## Evaluation Results
-| Metric | Base Model | This Model | Improvement |
 |---|---|---|---|
-| Silhouette Score | -0.012 | **0.847** | +0.859 |
-| Retrieval@5 | 20.3% | **100.0%** | +79.7% |
-| Retrieval@10 | 18.0% | **100.0%** | +82.0% |
-| Intra-class Similarity | 0.643 | **0.933** | +0.290 |
-| Inter-class Similarity | 0.569 | **0.174** | -0.395 |
 ## Usage
@@ -42,49 +64,55 @@ Unlike standard semantic similarity models that match by surface vocabulary, thi
 ```python
 from sentence_transformers import SentenceTransformer, util
-model = SentenceTransformer("structural-isomorphism/structural-v1")
 # Encode two descriptions from different domains
-emb1 = model.encode("A thermostat detects low temperature and turns on heating")
-emb2 = model.encode("The pancreas detects high blood sugar and releases insulin")
 similarity = util.cos_sim(emb1, emb2).item()
 print(f"Structural similarity: {similarity:.3f}")
-# Both are negative feedback loops -> high similarity
 ```
-### With the search engine
 ```python
-from structural_isomorphism import StructuralSearch
-search = StructuralSearch()
-results = search.query("Small input causes disproportionately large output")
-for r in results[:5]:
-    print(f"{r['name']} ({r['domain']}) - {r['score']:.3f}")
 ```
-## Intended Use
-- Cross-domain structural similarity search
-- Finding analogies and inspiration across fields
-- Scientific discovery: identifying unknown structural connections
-- Educational tools for teaching structural thinking
-## Limitations
-- Language: Currently trained on Chinese text only
-- Domain coverage: 84 structural types may not cover all possible patterns
-- The model recognizes structural types present in training data; novel structural types may not be well represented
 ## Citation
 ```bibtex
-@article{structural-isomorphism-2026,
-  title={Structural Isomorphism Search: Cross-Domain Structural Similarity Retrieval via Fine-tuned Embeddings},
-  author={Wan, Qihang},
-  journal={arXiv preprint arXiv:XXXX.XXXXX},
-  year={2026}
 }
 ```

 base_model: shibing624/text2vec-base-chinese
 ---
+# structural-isomorphism-v2 (expanded)
+A sentence-transformer model fine-tuned for **structural similarity across scientific domains** — recognizing that phenomena from completely different fields share the same underlying mathematical or dynamical structure.
+This is the **V2 model**, trained on the expanded 5689-sample dataset (original SIBD + 4475 expanded phenomena across physics, biology, ecology, finance, engineering). Compared to V1, V2 is significantly more **selective**: it finds fewer cross-domain matches but with much higher precision, making it ideal for strict isomorphism discovery.
 ## Model Description
 - **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
+- **Training data**: 5,689 descriptions (1,214 original SIBD + 4,475 from expanded 4,443-phenomenon knowledge base)
 - **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
+- **Hyperparameters**: 5 epochs | batch 16 | lr 2e-5 | warmup 10% | max 500 pairs per type
+- **Training time**: ~3.5 hours on Apple M4 (MPS), 10,985 steps
 ## Evaluation Results
+Evaluated on the expanded 4,443-phenomenon test set (1,000 sampled):
+| Metric | V1 | **V2** | Delta |
 |---|---|---|---|
+| Silhouette Score | -0.17 | **0.55** | **+0.72** |
+| Retrieval@5 | 23% | **96%** | **+73%** |
+On the original 84-type SIBD test set V2 also matches or exceeds V1 performance.
+## Discovery Pipeline Results
+Running V2 on the expanded 4,443-phenomenon knowledge base with threshold 0.70:
+| Step | Count |
+|---|---|
+| Cross-domain high-similarity pairs | **4,533** |
+| LLM strict screening (50 batches) — 5/5 score | **94** |
+| LLM strict screening — 4+/5 score (high potential) | **761** (16.8%) |
+| Deep analysis of 94 top pairs → **A-level candidate papers** | **19** |
+This represents a **75× stricter** retrieval than the V1 model on the same knowledge base (V1 returned 339,913 high-similarity pairs). V2 and V1 discover different structural isomorphisms and are complementary rather than redundant — their top-tier findings have zero overlap.
+Top V2 A-level discoveries (deep analysis score):
+1. Permafrost methane delayed feedback × Extinction debt (8.6)
+2. Semiconductor laser relaxation oscillation × Algorithmic stablecoin anchoring (8.6)
+3. Percolation threshold × Technology adoption chasm (8.5)
+4. MHC over-dominant selection × Model ensemble (8.5)
+5. Extinction debt × ENSO delayed oscillator (8.4)
 ## Usage
 ```python
 from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
 # Encode two descriptions from different domains
+emb1 = model.encode("永冻土融化释放甲烷形成的温度-甲烷-温度正反馈循环")
+emb2 = model.encode("生境破坏后物种世代反馈滞后引起的灭绝承诺债务")
 similarity = util.cos_sim(emb1, emb2).item()
 print(f"Structural similarity: {similarity:.3f}")
+# Both share delayed-feedback dynamics → high structural similarity
 ```
+### Discovery pipeline
 ```python
+from sentence_transformers import SentenceTransformer, util
+import json
+model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
+# Load phenomenon knowledge base
+kb = [json.loads(l) for l in open("kb-expanded.jsonl")]
+descs = [p["description"] for p in kb]
+emb = model.encode(descs, convert_to_numpy=True, batch_size=64)
+# Find cross-domain high-similarity pairs
+from itertools import combinations
+for i, j in combinations(range(len(kb)), 2):
+    if kb[i]["domain"] == kb[j]["domain"]:
+        continue
+    sim = float(util.cos_sim(emb[i], emb[j]))
+    if sim >= 0.70:
+        print(f"{sim:.3f}  {kb[i]['name']} × {kb[j]['name']}")
 ```
+## Links
+- **Project homepage**: https://structural.bytedance.city
+- **GitHub**: https://github.com/dada8899/structural-isomorphism
+- **V1 model**: [qinghuiwan/structural-isomorphism-v1](https://huggingface.co/qinghuiwan/structural-isomorphism-v1)
+- **Zenodo (v1.1)**: https://doi.org/10.5281/zenodo.19541416
 ## Citation
 ```bibtex
+@software{structural_isomorphism_v2_2026,
+  author = {Wan, Qinghui},
+  title  = {Structural Isomorphism Search Engine — V2 Model (Expanded)},
+  year   = {2026},
+  url    = {https://github.com/dada8899/structural-isomorphism}
 }
 ```