LunaLan07 commited on
Commit
9ca21f2
Β·
verified Β·
1 Parent(s): fba173e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -52
README.md CHANGED
@@ -1,49 +1,51 @@
1
  # BioHiCL-base: Hierarchical Multi-Label Contrastive Biomedical Retriever
2
 
3
  ## πŸ” Overview
4
- BioHiCL-base is a biomedical dense retriever trained with hierarchical MeSH supervision to capture fine-grained semantic relationships between biomedical texts.
5
 
6
- Unlike traditional dense retrievers trained with binary relevance signals, BioHiCL models semantic similarity using structured multi-label supervision derived from the MeSH ontology.
7
 
8
  ---
9
 
10
  ## πŸ’‘ Key Features
11
- - **Hierarchical supervision**: Uses MeSH ontology to model semantic relationships
12
- - **Multi-label similarity learning**: Captures partial semantic overlap between documents
13
- - **Contrastive + regression training**: Aligns embedding similarity with label similarity
14
- - **Efficient**: ~0.1B parameters, suitable for deployment on a single GPU
15
 
16
  ---
17
 
18
  ## 🧠 Model Details
19
- - **Model type**: Bi-encoder (dense retriever)
20
- - **Backbone**: BAAI/bge-base-en-v1.5
21
- - **Parameters**: ~0.1B
22
- - **Fine-tuning**: LoRA (merged into base model)
23
- - **Max input length**: 8192 tokens
24
 
25
  ---
26
 
27
  ## βš™οΈ How It Works
28
  BioHiCL aligns:
29
- - **Embedding similarity (SimE)**: cosine similarity between embeddings
30
- - **Label similarity (SimL)**: cosine similarity over weighted MeSH labels
31
 
32
- Training objective:
33
- - MSE loss to align SimE with SimL
34
- - Hierarchical contrastive loss to separate unrelated documents
35
 
36
  ---
37
 
38
- ## πŸš€ Usage - Text Similarity
39
 
40
  ```python
41
  from transformers import AutoTokenizer, AutoModel
42
  import torch
43
  import torch.nn.functional as F
44
 
45
- tokenizer = AutoTokenizer.from_pretrained("LunaLan07/BioHiCL-Large")
46
- model = AutoModel.from_pretrained("LunaLan07/BioHiCL-Large")
 
 
47
 
48
  def encode(texts):
49
  inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
@@ -60,41 +62,25 @@ print(similarity)
60
 
61
 
62
 
63
- ---
64
-
65
- ## πŸš€ Usage - Evaluation on BEIR Benchmark
66
-
67
  ```python
68
- from beir import util
69
- from beir.datasets.data_loader import GenericDataLoader
70
- from beir.retrieval.models import SentenceBERT
71
- from beir.retrieval.search.dense import DenseRetrievalExactSearch
72
- from beir.retrieval.evaluation import EvaluateRetrieval
73
-
74
- dataset = "scifact"
75
- url = ...
76
- data_path = util.download_and_unzip(url, "datasets")
77
- corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
78
-
79
- model_name = "LunaLan07/BioHiCL-Large"
80
- model = SentenceBERT(model_name)
81
- retriever = DenseRetrievalExactSearch(model, batch_size=16)
82
- top_k = 10 # top 10 documents per query
83
- results = retriever.search(corpus, queries, top_k=top_k, score_function="cos_sim")
84
-
85
- k_values = [1, 3, 5, 10]
86
- ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, results, k_values=k_values)
87
 
88
- ---
89
 
90
- ## πŸ“– Citation
91
- If you use this model, please cite:
92
 
93
- ```bibtex
94
- @article{lan2026biohicl,
95
- title={BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels},
96
- author={Lan, Mengfei, Zheng, Lecheng, and Kilicoglu, Halil},
97
- journal={ACL 2026},
98
- year={2026}
99
- }
100
 
 
 
 
 
 
 
 
1
  # BioHiCL-base: Hierarchical Multi-Label Contrastive Biomedical Retriever
2
 
3
  ## πŸ” Overview
4
+ BioHiCL-base is a biomedical dense retriever trained with hierarchical MeSH supervision to capture fine-grained semantic relationships between biomedical texts.
5
 
6
+ Unlike traditional dense retrievers trained with binary relevance signals, BioHiCL models semantic similarity using structured multi-label supervision derived from the MeSH ontology, enabling it to capture partial semantic overlap between documents.
7
 
8
  ---
9
 
10
  ## πŸ’‘ Key Features
11
+ - **Hierarchical supervision**: Leverages MeSH ontology to encode structured biomedical semantics
12
+ - **Multi-label similarity learning**: Captures graded semantic overlap beyond binary relevance
13
+ - **Contrastive + regression training**: Aligns embedding similarity with label similarity
14
+ - **Efficient**: ~0.1B parameters, suitable for deployment on a single GPU
15
 
16
  ---
17
 
18
  ## 🧠 Model Details
19
+ - **Model type**: Bi-encoder (dense retriever)
20
+ - **Backbone**: `BAAI/bge-base-en-v1.5`
21
+ - **Parameters**: ~0.1B
22
+ - **Fine-tuning**: LoRA (merged into base model)
23
+ - **Max input length**: 8192 tokens
24
 
25
  ---
26
 
27
  ## βš™οΈ How It Works
28
  BioHiCL aligns:
29
+ - **Embedding similarity (SimE)**: cosine similarity between embeddings
30
+ - **Label similarity (SimL)**: cosine similarity over weighted MeSH label vectors
31
 
32
+ ### Training Objective
33
+ - Mean Squared Error (MSE) loss to align SimE with SimL
34
+ - Hierarchical contrastive loss to separate unrelated documents and prevent embedding collapse
35
 
36
  ---
37
 
38
+ ## πŸš€ Usage β€” Text Similarity
39
 
40
  ```python
41
  from transformers import AutoTokenizer, AutoModel
42
  import torch
43
  import torch.nn.functional as F
44
 
45
+ model_name = "LunaLan07/BioHiCL-base"
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
48
+ model = AutoModel.from_pretrained(model_name)
49
 
50
  def encode(texts):
51
  inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
 
62
 
63
 
64
 
 
 
 
 
65
  ```python
66
+ from transformers import AutoTokenizer, AutoModel
67
+ import torch
68
+ import torch.nn.functional as F
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ model_name = "LunaLan07/BioHiCL-base"
71
 
72
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
73
+ model = AutoModel.from_pretrained(model_name)
74
 
75
+ def encode(texts):
76
+ inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
77
+ outputs = model(**inputs)
78
+ embeddings = outputs.last_hidden_state[:, 0] # CLS token
79
+ return F.normalize(embeddings, p=2, dim=1)
 
 
80
 
81
+ # Example
82
+ query = encode(["What are treatments for COPD?"])
83
+ doc = encode(["Chronic obstructive pulmonary disease is treated with bronchodilators."])
84
+
85
+ similarity = (query @ doc.T).item()
86
+ print(similarity)