|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- embedding |
|
|
- scientific |
|
|
- abstract |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- microsoft/deberta-base |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# InvDef-DeBERTa Model Card |
|
|
|
|
|
The InvDef-DeBERTa is a transformer encoder model pretrained for the domain of invasion biology. |
|
|
In addition to MLM pretraining on scientific abstracts (ca. 35000) from the domain of invasion biology, we pretrain it as embedding model on concept definitions for domain-relevant concepts. |
|
|
This dataset of concepts with definitions was created using an LLM by first extracting concepts from the scientific abstracts and then generating definitions for them. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** CLAUSE group at Bielefeld University |
|
|
- **Model type:** DeBERTa-base |
|
|
- **Languages:** Mostly english |
|
|
- **Finetuned from model:** [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [github.com/inas-argumentation/Ontology_Pretraining](https://github.com/inas-argumentation/Ontology_Pretraining) |
|
|
- **Paper:** [aclanthology.org/2025.findings-emnlp.1238/](https://aclanthology.org/2025.findings-emnlp.1238/) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Minimal example on how to process texts with this model: |
|
|
|
|
|
``` |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/InvDef-DeBERTa") |
|
|
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/InvDef-DeBERTa") |
|
|
|
|
|
text = "Your text to be embedded." |
|
|
batch = tokenizer([text], return_tensors="pt") |
|
|
model_output = model(**batch) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
This model was trained on a dataset of about 35000 scientific abstracts from the domain of invasion biology. |
|
|
Additionally, we used a dataset of 23,597 unique concepts extracted from the abstracts by an LLM, each accompanied by at least four LLM-generated concept definitions. |
|
|
We used a triplet loss to encourage definitions of the same concept to be placed nearby in the embedding space, and to also place related concepts (that co-occur frequently) in proximity. |
|
|
The dataset and exact training procedure can be found in our [GitHub repo](https://github.com/inas-argumentation/Ontology_Pretraining), |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
| Model | INAS Clf: Macro F1 | INAS Clf: Micro F1 | INAS Span: Token F1 | INAS Span: Span F1 | EICAT Clf: Macro F1 | EICAT Clf: Micro F1 | EICAT Evidence: NDCG | Avg. | |
|
|
|------------------------------------------------|----------|----------|----------|---------|--------------------|--------------------|-------|-------| |
|
|
| DeBERTa base | 0.674 | 0.745 | 0.406 | 0.218 | 0.392 | 0.416 | 0.505 | 0.483 | |
|
|
| [InvOntDef-DeBERTa](https://huggingface.co/CLAUSE-Bielefeld/InvOntDef-DeBERTa) | **0.750** | **0.812** | 0.414 | **0.242** | **0.504** | **0.518** | **0.530** | **0.538** | |
|
|
| InvDef-DeBERTa | 0.740 | 0.805 | **0.415** | 0.220 | 0.469 | 0.489 | 0.511 | 0.520 | |
|
|
|
|
|
The better-performing [InvOntDef-DeBERTa](https://huggingface.co/CLAUSE-Bielefeld/InvOntDef-DeBERTa) was also trained by us, using ontology-derived data instead of purely LLM-generated data. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{brinner-etal-2025-enhancing, |
|
|
title = "Enhancing Domain-Specific Encoder Models with {LLM}-Generated Data: How to Leverage Ontologies, and How to Do Without Them", |
|
|
author = "Brinner, Marc Felix and |
|
|
Al Mustafa, Tarek and |
|
|
Zarrie{\ss}, Sina", |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.findings-emnlp.1238/", |
|
|
doi = "10.18653/v1/2025.findings-emnlp.1238", |
|
|
pages = "22740--22754", |
|
|
ISBN = "979-8-89176-335-7" |
|
|
} |
|
|
``` |