Update README with V2 expanded metrics, discovery pipeline results, and 19 A-level findings
Browse files
README.md
CHANGED
|
@@ -12,28 +12,50 @@ tags:
|
|
| 12 |
base_model: shibing624/text2vec-base-chinese
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# structural-isomorphism
|
| 16 |
|
| 17 |
-
A sentence-transformer model fine-tuned for **structural similarity**
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
## Model Description
|
| 22 |
|
| 23 |
- **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
|
| 24 |
-
- **Training data**:
|
| 25 |
- **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
|
| 26 |
-
- **
|
|
|
|
| 27 |
|
| 28 |
## Evaluation Results
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|---|---|---|---|
|
| 32 |
-
| Silhouette Score | -0.
|
| 33 |
-
| Retrieval@5 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Usage
|
| 39 |
|
|
@@ -42,49 +64,55 @@ Unlike standard semantic similarity models that match by surface vocabulary, thi
|
|
| 42 |
```python
|
| 43 |
from sentence_transformers import SentenceTransformer, util
|
| 44 |
|
| 45 |
-
model = SentenceTransformer("structural-isomorphism
|
| 46 |
|
| 47 |
# Encode two descriptions from different domains
|
| 48 |
-
emb1 = model.encode("
|
| 49 |
-
emb2 = model.encode("
|
| 50 |
|
| 51 |
similarity = util.cos_sim(emb1, emb2).item()
|
| 52 |
print(f"Structural similarity: {similarity:.3f}")
|
| 53 |
-
# Both
|
| 54 |
```
|
| 55 |
|
| 56 |
-
###
|
| 57 |
|
| 58 |
```python
|
| 59 |
-
from
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
```
|
| 66 |
|
| 67 |
-
##
|
| 68 |
-
|
| 69 |
-
- Cross-domain structural similarity search
|
| 70 |
-
- Finding analogies and inspiration across fields
|
| 71 |
-
- Scientific discovery: identifying unknown structural connections
|
| 72 |
-
- Educational tools for teaching structural thinking
|
| 73 |
-
|
| 74 |
-
## Limitations
|
| 75 |
|
| 76 |
-
-
|
| 77 |
-
-
|
| 78 |
-
-
|
|
|
|
| 79 |
|
| 80 |
## Citation
|
| 81 |
|
| 82 |
```bibtex
|
| 83 |
-
@
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
}
|
| 89 |
```
|
| 90 |
|
|
|
|
| 12 |
base_model: shibing624/text2vec-base-chinese
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# structural-isomorphism-v2 (expanded)
|
| 16 |
|
| 17 |
+
A sentence-transformer model fine-tuned for **structural similarity across scientific domains** — recognizing that phenomena from completely different fields share the same underlying mathematical or dynamical structure.
|
| 18 |
|
| 19 |
+
This is the **V2 model**, trained on the expanded 5689-sample dataset (original SIBD + 4475 expanded phenomena across physics, biology, ecology, finance, engineering). Compared to V1, V2 is significantly more **selective**: it finds fewer cross-domain matches but with much higher precision, making it ideal for strict isomorphism discovery.
|
| 20 |
|
| 21 |
## Model Description
|
| 22 |
|
| 23 |
- **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
|
| 24 |
+
- **Training data**: 5,689 descriptions (1,214 original SIBD + 4,475 from expanded 4,443-phenomenon knowledge base)
|
| 25 |
- **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
|
| 26 |
+
- **Hyperparameters**: 5 epochs | batch 16 | lr 2e-5 | warmup 10% | max 500 pairs per type
|
| 27 |
+
- **Training time**: ~3.5 hours on Apple M4 (MPS), 10,985 steps
|
| 28 |
|
| 29 |
## Evaluation Results
|
| 30 |
|
| 31 |
+
Evaluated on the expanded 4,443-phenomenon test set (1,000 sampled):
|
| 32 |
+
|
| 33 |
+
| Metric | V1 | **V2** | Delta |
|
| 34 |
|---|---|---|---|
|
| 35 |
+
| Silhouette Score | -0.17 | **0.55** | **+0.72** |
|
| 36 |
+
| Retrieval@5 | 23% | **96%** | **+73%** |
|
| 37 |
+
|
| 38 |
+
On the original 84-type SIBD test set V2 also matches or exceeds V1 performance.
|
| 39 |
+
|
| 40 |
+
## Discovery Pipeline Results
|
| 41 |
+
|
| 42 |
+
Running V2 on the expanded 4,443-phenomenon knowledge base with threshold 0.70:
|
| 43 |
+
|
| 44 |
+
| Step | Count |
|
| 45 |
+
|---|---|
|
| 46 |
+
| Cross-domain high-similarity pairs | **4,533** |
|
| 47 |
+
| LLM strict screening (50 batches) — 5/5 score | **94** |
|
| 48 |
+
| LLM strict screening — 4+/5 score (high potential) | **761** (16.8%) |
|
| 49 |
+
| Deep analysis of 94 top pairs → **A-level candidate papers** | **19** |
|
| 50 |
+
|
| 51 |
+
This represents a **75× stricter** retrieval than the V1 model on the same knowledge base (V1 returned 339,913 high-similarity pairs). V2 and V1 discover different structural isomorphisms and are complementary rather than redundant — their top-tier findings have zero overlap.
|
| 52 |
+
|
| 53 |
+
Top V2 A-level discoveries (deep analysis score):
|
| 54 |
+
1. Permafrost methane delayed feedback × Extinction debt (8.6)
|
| 55 |
+
2. Semiconductor laser relaxation oscillation × Algorithmic stablecoin anchoring (8.6)
|
| 56 |
+
3. Percolation threshold × Technology adoption chasm (8.5)
|
| 57 |
+
4. MHC over-dominant selection × Model ensemble (8.5)
|
| 58 |
+
5. Extinction debt × ENSO delayed oscillator (8.4)
|
| 59 |
|
| 60 |
## Usage
|
| 61 |
|
|
|
|
| 64 |
```python
|
| 65 |
from sentence_transformers import SentenceTransformer, util
|
| 66 |
|
| 67 |
+
model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
|
| 68 |
|
| 69 |
# Encode two descriptions from different domains
|
| 70 |
+
emb1 = model.encode("永冻土融化释放甲烷形成的温度-甲烷-温度正反馈循环")
|
| 71 |
+
emb2 = model.encode("生境破坏后物种世代反馈滞后引起的灭绝承诺债务")
|
| 72 |
|
| 73 |
similarity = util.cos_sim(emb1, emb2).item()
|
| 74 |
print(f"Structural similarity: {similarity:.3f}")
|
| 75 |
+
# Both share delayed-feedback dynamics → high structural similarity
|
| 76 |
```
|
| 77 |
|
| 78 |
+
### Discovery pipeline
|
| 79 |
|
| 80 |
```python
|
| 81 |
+
from sentence_transformers import SentenceTransformer, util
|
| 82 |
+
import json
|
| 83 |
+
|
| 84 |
+
model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
|
| 85 |
+
|
| 86 |
+
# Load phenomenon knowledge base
|
| 87 |
+
kb = [json.loads(l) for l in open("kb-expanded.jsonl")]
|
| 88 |
+
descs = [p["description"] for p in kb]
|
| 89 |
+
emb = model.encode(descs, convert_to_numpy=True, batch_size=64)
|
| 90 |
+
|
| 91 |
+
# Find cross-domain high-similarity pairs
|
| 92 |
+
from itertools import combinations
|
| 93 |
+
for i, j in combinations(range(len(kb)), 2):
|
| 94 |
+
if kb[i]["domain"] == kb[j]["domain"]:
|
| 95 |
+
continue
|
| 96 |
+
sim = float(util.cos_sim(emb[i], emb[j]))
|
| 97 |
+
if sim >= 0.70:
|
| 98 |
+
print(f"{sim:.3f} {kb[i]['name']} × {kb[j]['name']}")
|
| 99 |
```
|
| 100 |
|
| 101 |
+
## Links
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
- **Project homepage**: https://structural.bytedance.city
|
| 104 |
+
- **GitHub**: https://github.com/dada8899/structural-isomorphism
|
| 105 |
+
- **V1 model**: [qinghuiwan/structural-isomorphism-v1](https://huggingface.co/qinghuiwan/structural-isomorphism-v1)
|
| 106 |
+
- **Zenodo (v1.1)**: https://doi.org/10.5281/zenodo.19541416
|
| 107 |
|
| 108 |
## Citation
|
| 109 |
|
| 110 |
```bibtex
|
| 111 |
+
@software{structural_isomorphism_v2_2026,
|
| 112 |
+
author = {Wan, Qinghui},
|
| 113 |
+
title = {Structural Isomorphism Search Engine — V2 Model (Expanded)},
|
| 114 |
+
year = {2026},
|
| 115 |
+
url = {https://github.com/dada8899/structural-isomorphism}
|
| 116 |
}
|
| 117 |
```
|
| 118 |
|