qinghuiwan commited on
Commit
3b5ff02
·
verified ·
1 Parent(s): 9c9c9b5

Update README with V2 expanded metrics, discovery pipeline results, and 19 A-level findings

Browse files
Files changed (1) hide show
  1. README.md +66 -38
README.md CHANGED
@@ -12,28 +12,50 @@ tags:
12
  base_model: shibing624/text2vec-base-chinese
13
  ---
14
 
15
- # structural-isomorphism/structural-v1
16
 
17
- A sentence-transformer model fine-tuned for **structural similarity** -- recognizing that phenomena from completely different domains share the same underlying structure.
18
 
19
- Unlike standard semantic similarity models that match by surface vocabulary, this model maps descriptions with the same structural pattern close together in embedding space, regardless of domain.
20
 
21
  ## Model Description
22
 
23
  - **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
24
- - **Training data**: [SIBD](https://huggingface.co/datasets/structural-isomorphism/SIBD) -- 1,214 descriptions across 84 structural types
25
  - **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
26
- - **Epochs**: 10 | **Batch size**: 16 | **Learning rate**: 2e-5 | **Warmup**: 10%
 
27
 
28
  ## Evaluation Results
29
 
30
- | Metric | Base Model | This Model | Improvement |
 
 
31
  |---|---|---|---|
32
- | Silhouette Score | -0.012 | **0.847** | +0.859 |
33
- | Retrieval@5 | 20.3% | **100.0%** | +79.7% |
34
- | Retrieval@10 | 18.0% | **100.0%** | +82.0% |
35
- | Intra-class Similarity | 0.643 | **0.933** | +0.290 |
36
- | Inter-class Similarity | 0.569 | **0.174** | -0.395 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Usage
39
 
@@ -42,49 +64,55 @@ Unlike standard semantic similarity models that match by surface vocabulary, thi
42
  ```python
43
  from sentence_transformers import SentenceTransformer, util
44
 
45
- model = SentenceTransformer("structural-isomorphism/structural-v1")
46
 
47
  # Encode two descriptions from different domains
48
- emb1 = model.encode("A thermostat detects low temperature and turns on heating")
49
- emb2 = model.encode("The pancreas detects high blood sugar and releases insulin")
50
 
51
  similarity = util.cos_sim(emb1, emb2).item()
52
  print(f"Structural similarity: {similarity:.3f}")
53
- # Both are negative feedback loops -> high similarity
54
  ```
55
 
56
- ### With the search engine
57
 
58
  ```python
59
- from structural_isomorphism import StructuralSearch
60
-
61
- search = StructuralSearch()
62
- results = search.query("Small input causes disproportionately large output")
63
- for r in results[:5]:
64
- print(f"{r['name']} ({r['domain']}) - {r['score']:.3f}")
 
 
 
 
 
 
 
 
 
 
 
 
65
  ```
66
 
67
- ## Intended Use
68
-
69
- - Cross-domain structural similarity search
70
- - Finding analogies and inspiration across fields
71
- - Scientific discovery: identifying unknown structural connections
72
- - Educational tools for teaching structural thinking
73
-
74
- ## Limitations
75
 
76
- - Language: Currently trained on Chinese text only
77
- - Domain coverage: 84 structural types may not cover all possible patterns
78
- - The model recognizes structural types present in training data; novel structural types may not be well represented
 
79
 
80
  ## Citation
81
 
82
  ```bibtex
83
- @article{structural-isomorphism-2026,
84
- title={Structural Isomorphism Search: Cross-Domain Structural Similarity Retrieval via Fine-tuned Embeddings},
85
- author={Wan, Qihang},
86
- journal={arXiv preprint arXiv:XXXX.XXXXX},
87
- year={2026}
88
  }
89
  ```
90
 
 
12
  base_model: shibing624/text2vec-base-chinese
13
  ---
14
 
15
+ # structural-isomorphism-v2 (expanded)
16
 
17
+ A sentence-transformer model fine-tuned for **structural similarity across scientific domains** recognizing that phenomena from completely different fields share the same underlying mathematical or dynamical structure.
18
 
19
+ This is the **V2 model**, trained on the expanded 5689-sample dataset (original SIBD + 4475 expanded phenomena across physics, biology, ecology, finance, engineering). Compared to V1, V2 is significantly more **selective**: it finds fewer cross-domain matches but with much higher precision, making it ideal for strict isomorphism discovery.
20
 
21
  ## Model Description
22
 
23
  - **Base model**: [shibing624/text2vec-base-chinese](https://huggingface.co/shibing624/text2vec-base-chinese) (BERT-based, 768-dim)
24
+ - **Training data**: 5,689 descriptions (1,214 original SIBD + 4,475 from expanded 4,443-phenomenon knowledge base)
25
  - **Training objective**: MultipleNegativesRankingLoss (positive pairs = same structural type, different domain)
26
+ - **Hyperparameters**: 5 epochs | batch 16 | lr 2e-5 | warmup 10% | max 500 pairs per type
27
+ - **Training time**: ~3.5 hours on Apple M4 (MPS), 10,985 steps
28
 
29
  ## Evaluation Results
30
 
31
+ Evaluated on the expanded 4,443-phenomenon test set (1,000 sampled):
32
+
33
+ | Metric | V1 | **V2** | Delta |
34
  |---|---|---|---|
35
+ | Silhouette Score | -0.17 | **0.55** | **+0.72** |
36
+ | Retrieval@5 | 23% | **96%** | **+73%** |
37
+
38
+ On the original 84-type SIBD test set V2 also matches or exceeds V1 performance.
39
+
40
+ ## Discovery Pipeline Results
41
+
42
+ Running V2 on the expanded 4,443-phenomenon knowledge base with threshold 0.70:
43
+
44
+ | Step | Count |
45
+ |---|---|
46
+ | Cross-domain high-similarity pairs | **4,533** |
47
+ | LLM strict screening (50 batches) — 5/5 score | **94** |
48
+ | LLM strict screening — 4+/5 score (high potential) | **761** (16.8%) |
49
+ | Deep analysis of 94 top pairs → **A-level candidate papers** | **19** |
50
+
51
+ This represents a **75× stricter** retrieval than the V1 model on the same knowledge base (V1 returned 339,913 high-similarity pairs). V2 and V1 discover different structural isomorphisms and are complementary rather than redundant — their top-tier findings have zero overlap.
52
+
53
+ Top V2 A-level discoveries (deep analysis score):
54
+ 1. Permafrost methane delayed feedback × Extinction debt (8.6)
55
+ 2. Semiconductor laser relaxation oscillation × Algorithmic stablecoin anchoring (8.6)
56
+ 3. Percolation threshold × Technology adoption chasm (8.5)
57
+ 4. MHC over-dominant selection × Model ensemble (8.5)
58
+ 5. Extinction debt × ENSO delayed oscillator (8.4)
59
 
60
  ## Usage
61
 
 
64
  ```python
65
  from sentence_transformers import SentenceTransformer, util
66
 
67
+ model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
68
 
69
  # Encode two descriptions from different domains
70
+ emb1 = model.encode("永冻土融化释放甲烷形成的温度-甲烷-温度正反馈循环")
71
+ emb2 = model.encode("生境破坏后物种世代反馈滞后引起的灭绝承诺债务")
72
 
73
  similarity = util.cos_sim(emb1, emb2).item()
74
  print(f"Structural similarity: {similarity:.3f}")
75
+ # Both share delayed-feedback dynamics high structural similarity
76
  ```
77
 
78
+ ### Discovery pipeline
79
 
80
  ```python
81
+ from sentence_transformers import SentenceTransformer, util
82
+ import json
83
+
84
+ model = SentenceTransformer("qinghuiwan/structural-isomorphism-v2-expanded")
85
+
86
+ # Load phenomenon knowledge base
87
+ kb = [json.loads(l) for l in open("kb-expanded.jsonl")]
88
+ descs = [p["description"] for p in kb]
89
+ emb = model.encode(descs, convert_to_numpy=True, batch_size=64)
90
+
91
+ # Find cross-domain high-similarity pairs
92
+ from itertools import combinations
93
+ for i, j in combinations(range(len(kb)), 2):
94
+ if kb[i]["domain"] == kb[j]["domain"]:
95
+ continue
96
+ sim = float(util.cos_sim(emb[i], emb[j]))
97
+ if sim >= 0.70:
98
+ print(f"{sim:.3f} {kb[i]['name']} × {kb[j]['name']}")
99
  ```
100
 
101
+ ## Links
 
 
 
 
 
 
 
102
 
103
+ - **Project homepage**: https://structural.bytedance.city
104
+ - **GitHub**: https://github.com/dada8899/structural-isomorphism
105
+ - **V1 model**: [qinghuiwan/structural-isomorphism-v1](https://huggingface.co/qinghuiwan/structural-isomorphism-v1)
106
+ - **Zenodo (v1.1)**: https://doi.org/10.5281/zenodo.19541416
107
 
108
  ## Citation
109
 
110
  ```bibtex
111
+ @software{structural_isomorphism_v2_2026,
112
+ author = {Wan, Qinghui},
113
+ title = {Structural Isomorphism Search Engine — V2 Model (Expanded)},
114
+ year = {2026},
115
+ url = {https://github.com/dada8899/structural-isomorphism}
116
  }
117
  ```
118