File size: 5,370 Bytes
198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 c1b850d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d b3a319b 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 198bf0d 3766006 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: apache-2.0
language: de
library_name: transformers
tags:
- token-classification
- named-entity-recognition
- german
- xlm-roberta
- peft
- lora
---
# 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa
<center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center>
## 🔍 Overview
**GermaNER** is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with **LoRA adapters**. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.
> This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.
---
## 🧠 Architecture
- **Base model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
- **Fine-tuning**: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685)
- **Adapter config**:
- `r=16`, `alpha=32`, `dropout=0.1`
- LoRA applied to: `query`, `key`, `value` projection layers
- **Max sequence length**: 128 tokens
- **Mixed-precision training**: (fp16)
- **Training samples**: 44,000 sentences
---
## 🏷️ Label Schema
The model uses the standard BIO format with the following 7 labels:
| Label | Description |
|-----------|-----------------------------------|
| `O` | Outside any named entity |
| `B-PER` | Beginning of a person entity |
| `I-PER` | Inside a person entity |
| `B-ORG` | Beginning of an organization |
| `I-ORG` | Inside an organization |
| `B-LOC` | Beginning of a location entity |
| `I-LOC` | Inside a location entity |
### 🗂️ Training-Set Concatenation
The model was trained on a **concatenated corpus** of GermEval 2014 and WikiANN-de:
| Split | Sentences |
|-------|-----------|
| **Training** | **44 000** |
| **Evaluation** | **15 100** |
The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.
## 🚀 Getting Started
This model uses **adapter-based inference**, not a full model. Use `peft` to attach the adapter weights.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from peft import PeftModel, PeftConfig
model_id = "fau/GermaNER"
# Define label mappings
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {idx: label for idx, label in enumerate(label_names)}
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
# Load PEFT adapter config
peft_config = PeftConfig.from_pretrained(model_id, token=True)
# Load base model with label mappings
base_model = AutoModelForTokenClassification.from_pretrained(
peft_config.base_model_name_or_path,
num_labels=len(label_names),
id2label=id2label,
label2id=label2id,
token=True
)
# Attach adapter
model = PeftModel.from_pretrained(base_model, model_id, token=True)
# Create pipeline
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Run inference
text = "Angela Merkel war Bundeskanzlerin von Deutschland."
entities = ner_pipe(text)
for ent in entities:
print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
```
## Files & Structure
File | Description
---- | -----------
adapter_model.safetensors | LoRA adapter weights
adapter_config.json | PEFT config for the adapter
tokenizer.json | Tokenizer for XLM-Roberta
sentencepiece.bpe.model | SentencePiece model file
special_tokens_map.json | Special tokens config
tokenizer_config.json | Tokenizer settings
## 💡 Open-Source Use Cases (Hugging Face)
- **Streaming news pipelines** – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.
- **Parliament analytics** – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer.
- **Biomedical text mining** – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.
- **Conversational AI** – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.
📜 License
This model is licensed under the Apache 2.0 License.
For questions, reach out on GitHub or Hugging Face 🤝
---
Open source contributions are welcome via:
- A `demo.ipynb` notebook
- An evaluation script using `seqeval`
- A `gr.Interface` or Streamlit demo for public inference |