File size: 4,114 Bytes
6c585e8
 
c21a085
 
 
 
 
 
 
 
 
 
6c585e8
 
c21a085
6c585e8
c21a085
 
 
6c585e8
 
 
 
 
c21a085
 
 
 
6c585e8
c21a085
6c585e8
c21a085
 
6c585e8
 
 
c21a085
6c585e8
c21a085
 
 
 
6c585e8
c21a085
 
 
 
6c585e8
c21a085
6c585e8
c21a085
 
 
 
6c585e8
 
 
c21a085
 
 
 
 
6c585e8
c21a085
6c585e8
 
c21a085
6c585e8
 
 
c21a085
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
library_name: transformers
tags:
- embedding
- scientific
- abstract
license: mit
language:
- en
base_model:
- microsoft/deberta-base
pipeline_tag: feature-extraction
---

# InvDef-DeBERTa Model Card

The InvDef-DeBERTa is a transformer encoder model pretrained for the domain of invasion biology.
In addition to MLM pretraining on scientific abstracts (ca. 35000) from the domain of invasion biology, we pretrain it as embedding model on concept definitions for domain-relevant concepts.
This dataset of concepts with definitions was created using an LLM by first extracting concepts from the scientific abstracts and then generating definitions for them.

## Model Details

### Model Description

- **Developed by:** CLAUSE group at Bielefeld University
- **Model type:** DeBERTa-base
- **Languages:** Mostly english
- **Finetuned from model:** [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base)

### Model Sources

- **Repository:** [github.com/inas-argumentation/Ontology_Pretraining](https://github.com/inas-argumentation/Ontology_Pretraining)
- **Paper:** [aclanthology.org/2025.findings-emnlp.1238/](https://aclanthology.org/2025.findings-emnlp.1238/)

## How to Get Started with the Model

Minimal example on how to process texts with this model:

```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/InvDef-DeBERTa")
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/InvDef-DeBERTa")

text = "Your text to be embedded."
batch = tokenizer([text], return_tensors="pt")
model_output = model(**batch)
```

## Training Details

This model was trained on a dataset of about 35000 scientific abstracts from the domain of invasion biology.
Additionally, we used a dataset of 23,597 unique concepts extracted from the abstracts by an LLM, each accompanied by at least four LLM-generated concept definitions.
We used a triplet loss to encourage definitions of the same concept to be placed nearby in the embedding space, and to also place related concepts (that co-occur frequently) in proximity.
The dataset and exact training procedure can be found in our [GitHub repo](https://github.com/inas-argumentation/Ontology_Pretraining),

## Evaluation

| Model                                          | INAS Clf: Macro F1 | INAS Clf: Micro F1 | INAS Span: Token F1 | INAS Span: Span F1 | EICAT Clf: Macro F1 | EICAT Clf: Micro F1 | EICAT Evidence: NDCG  | Avg.  |
|------------------------------------------------|----------|----------|----------|---------|--------------------|--------------------|-------|-------|
| DeBERTa base                                   | 0.674    | 0.745    | 0.406    | 0.218   | 0.392              | 0.416              | 0.505 | 0.483 |
| [InvOntDef-DeBERTa](https://huggingface.co/CLAUSE-Bielefeld/InvOntDef-DeBERTa) | **0.750**    | **0.812**    | 0.414    | **0.242**   | **0.504**              | **0.518**              | **0.530** | **0.538** |
| InvDef-DeBERTa | 0.740    | 0.805    | **0.415**    | 0.220   | 0.469              | 0.489              | 0.511 | 0.520 |

The better-performing [InvOntDef-DeBERTa](https://huggingface.co/CLAUSE-Bielefeld/InvOntDef-DeBERTa) was also trained by us, using ontology-derived data instead of purely LLM-generated data.


## Citation

**BibTeX:**

```bibtex
@inproceedings{brinner-etal-2025-enhancing,
    title = "Enhancing Domain-Specific Encoder Models with {LLM}-Generated Data: How to Leverage Ontologies, and How to Do Without Them",
    author = "Brinner, Marc Felix  and
      Al Mustafa, Tarek  and
      Zarrie{\ss}, Sina",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1238/",
    doi = "10.18653/v1/2025.findings-emnlp.1238",
    pages = "22740--22754",
    ISBN = "979-8-89176-335-7"
}
```