File size: 3,676 Bytes
008678b
43e95ed
a62cf97
 
e3b752b
817b4e5
43e95ed
9ca21f2
43e95ed
90d168c
 
e3b752b
43e95ed
e3b752b
9ca21f2
 
 
817b4e5
a62cf97
43e95ed
e3b752b
43e95ed
e3b752b
9ca21f2
817b4e5
 
9ca21f2
817b4e5
a62cf97
 
 
 
 
 
 
 
 
 
43e95ed
e3b752b
43e95ed
e3b752b
 
a62cf97
 
 
e3b752b
43e95ed
5fe9485
 
 
 
f75bd32
a62cf97
0c5deb2
a62cf97
43e95ed
e3b752b
2b8c0e4
 
dd568c6
2b8c0e4
 
43e95ed
dd568c6
 
2b8c0e4
dd568c6
43e95ed
2b8c0e4
 
fba173e
f75bd32
817b4e5
dd568c6
2b8c0e4
 
fba173e
dd568c6
 
 
 
 
 
32834bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# BioHiCL-Large: Hierarchical Multi-Label Contrastive Biomedical Retriever

## Model Card

## 🔍 Overview
BioHiCL-large is a biomedical dense retriever trained with hierarchical MeSH supervision to capture fine-grained semantic relationships between biomedical texts.

Unlike traditional dense retrievers trained with binary relevance signals, BioHiCL models semantic similarity using structured multi-label supervision derived from the MeSH ontology, enabling it to capture partial semantic overlap between documents.

# ⚠️ Important: Please ensure that the `transformers` version matches exactly (4.57.3), as other versions may lead to compatibility issues or unexpected behavior.

---

## 💡 Key Features
- **Hierarchical supervision**: Leverages MeSH ontology to encode structured biomedical semantics  
- **Multi-label similarity learning**: Captures graded semantic overlap beyond binary relevance  
- **Contrastive + regression training**: Aligns embedding similarity with label similarity  
- **Efficient**: ~0.3B parameters, suitable for deployment on a single GPU  
- **Domain-adapted retriever**: Fine-tuned from a strong general-purpose bi-encoder  

---

## 🧠 Model Details
- **Model type**: Bi-encoder (dense retriever)  
- **Backbone**: BAAI/bge-large-en-v1.5  
- **Parameters**: ~0.3B  
- **Fine-tuning**: LoRA (merged into base model)  
- **Max input length**: 512 tokens  
- **Training data**: Biomedical abstracts annotated with MeSH labels (e.g., BioASQ-derived corpora)  

---

## ⚙️ Intended Use
This model is intended for biomedical information retrieval tasks such as:
- Scientific literature search (e.g., PubMed-style retrieval)
- Biomedical document ranking
- Query–abstract semantic matching
- Benchmark evaluation on BEIR biomedical subsets

---

## ⚙️ How It Works
BioHiCL aligns:

- Embedding similarity (SimE): cosine similarity between document embeddings  
- Label similarity (SimL): cosine similarity over weighted MeSH multi-label vectors  
---


## ⚙️ Requirements
- python >= 3.8  
- transformers == 4.57.3
> ⚠️ Important: Please ensure that the `transformers` version matches exactly (4.57.3), as other versions may lead to compatibility issues or unexpected behavior.
---

## 🚀 Usage (BEIR Evaluation)

```python
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.models import SentenceBERT
from beir.retrieval.search.dense import DenseRetrievalExactSearch
from beir.retrieval.evaluation import EvaluateRetrieval


# 1. Download  load the SciFact dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/" + dataset + ".zip"

data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# ⚠️ Important: Please ensure that the `transformers` version matches exactly (4.57.3), as other versions may lead to compatibility issues or unexpected behavior.
model_name = "LunaLan07/BioHiCL-large"
model = SentenceBERT(model_name)

retriever = DenseRetrievalExactSearch(model, batch_size=16)

top_k = 10  # top 10 documents per query
results = retriever.search(corpus, queries, top_k=top_k, score_function="cos_sim")

k_values = [1, 3, 5, 10]
ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results, k_values=k_values)


```

## 📖 Citation

If you use this model, please cite:

```bibtex
@article{lan2026biohicl,
  title={BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels},
  author={Lan, Mengfei and Zheng, Lecheng and Kilicoglu, Halil},
  booktitle={ACL 2026},
  year={2026}
}

```