|
|
--- |
|
|
license: apache-2.0 |
|
|
language: de |
|
|
library_name: transformers |
|
|
tags: |
|
|
- token-classification |
|
|
- named-entity-recognition |
|
|
- german |
|
|
- xlm-roberta |
|
|
- peft |
|
|
- lora |
|
|
--- |
|
|
|
|
|
# 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa |
|
|
|
|
|
<center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center> |
|
|
|
|
|
## 🔍 Overview |
|
|
|
|
|
**GermaNER** is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with **LoRA adapters**. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts. |
|
|
|
|
|
> This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Architecture |
|
|
|
|
|
- **Base model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large) |
|
|
- **Fine-tuning**: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685) |
|
|
- **Adapter config**: |
|
|
- `r=16`, `alpha=32`, `dropout=0.1` |
|
|
- LoRA applied to: `query`, `key`, `value` projection layers |
|
|
- **Max sequence length**: 128 tokens |
|
|
- **Mixed-precision training**: (fp16) |
|
|
- **Training samples**: 44,000 sentences |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏷️ Label Schema |
|
|
|
|
|
The model uses the standard BIO format with the following 7 labels: |
|
|
|
|
|
| Label | Description | |
|
|
|-----------|-----------------------------------| |
|
|
| `O` | Outside any named entity | |
|
|
| `B-PER` | Beginning of a person entity | |
|
|
| `I-PER` | Inside a person entity | |
|
|
| `B-ORG` | Beginning of an organization | |
|
|
| `I-ORG` | Inside an organization | |
|
|
| `B-LOC` | Beginning of a location entity | |
|
|
| `I-LOC` | Inside a location entity | |
|
|
|
|
|
### 🗂️ Training-Set Concatenation |
|
|
The model was trained on a **concatenated corpus** of GermEval 2014 and WikiANN-de: |
|
|
|
|
|
| Split | Sentences | |
|
|
|-------|-----------| |
|
|
| **Training** | **44 000** | |
|
|
| **Evaluation** | **15 100** | |
|
|
|
|
|
The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits. |
|
|
|
|
|
## 🚀 Getting Started |
|
|
|
|
|
This model uses **adapter-based inference**, not a full model. Use `peft` to attach the adapter weights. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
from peft import PeftModel, PeftConfig |
|
|
|
|
|
model_id = "fau/GermaNER" |
|
|
|
|
|
# Define label mappings |
|
|
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"] |
|
|
label2id = {label: idx for idx, label in enumerate(label_names)} |
|
|
id2label = {idx: label for idx, label in enumerate(label_names)} |
|
|
|
|
|
# Load tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True) |
|
|
|
|
|
# Load PEFT adapter config |
|
|
peft_config = PeftConfig.from_pretrained(model_id, token=True) |
|
|
|
|
|
# Load base model with label mappings |
|
|
base_model = AutoModelForTokenClassification.from_pretrained( |
|
|
peft_config.base_model_name_or_path, |
|
|
num_labels=len(label_names), |
|
|
id2label=id2label, |
|
|
label2id=label2id, |
|
|
token=True |
|
|
) |
|
|
|
|
|
# Attach adapter |
|
|
model = PeftModel.from_pretrained(base_model, model_id, token=True) |
|
|
|
|
|
# Create pipeline |
|
|
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
|
|
# Run inference |
|
|
text = "Angela Merkel war Bundeskanzlerin von Deutschland." |
|
|
entities = ner_pipe(text) |
|
|
|
|
|
for ent in entities: |
|
|
print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})") |
|
|
``` |
|
|
|
|
|
## Files & Structure |
|
|
File | Description |
|
|
---- | ----------- |
|
|
adapter_model.safetensors | LoRA adapter weights |
|
|
adapter_config.json | PEFT config for the adapter |
|
|
tokenizer.json | Tokenizer for XLM-Roberta |
|
|
sentencepiece.bpe.model | SentencePiece model file |
|
|
special_tokens_map.json | Special tokens config |
|
|
tokenizer_config.json | Tokenizer settings |
|
|
|
|
|
## 💡 Open-Source Use Cases (Hugging Face) |
|
|
|
|
|
- **Streaming news pipelines** – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components. |
|
|
|
|
|
- **Parliament analytics** – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer. |
|
|
|
|
|
- **Biomedical text mining** – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries. |
|
|
|
|
|
- **Conversational AI** – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots. |
|
|
|
|
|
|
|
|
📜 License |
|
|
This model is licensed under the Apache 2.0 License. |
|
|
|
|
|
For questions, reach out on GitHub or Hugging Face 🤝 |
|
|
--- |
|
|
|
|
|
Open source contributions are welcome via: |
|
|
- A `demo.ipynb` notebook |
|
|
- An evaluation script using `seqeval` |
|
|
- A `gr.Interface` or Streamlit demo for public inference |