File size: 5,370 Bytes
198bf0d
 
3766006
 
 
 
 
 
 
 
 
198bf0d
 
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
 
 
 
 
 
c1b850d
3766006
198bf0d
 
 
 
 
3766006
198bf0d
3766006
 
 
 
 
 
 
 
 
198bf0d
3766006
 
198bf0d
3766006
 
 
 
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
198bf0d
3766006
 
 
198bf0d
b3a319b
198bf0d
3766006
 
 
 
198bf0d
3766006
 
198bf0d
3766006
 
198bf0d
3766006
 
 
 
 
 
 
 
198bf0d
3766006
 
198bf0d
3766006
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198bf0d
3766006
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language: de
library_name: transformers
tags:
  - token-classification
  - named-entity-recognition
  - german
  - xlm-roberta
  - peft
  - lora
---

# 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa

<center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center>

## 🔍 Overview

**GermaNER** is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with **LoRA adapters**. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.

> This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.

---

## 🧠 Architecture

- **Base model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
- **Fine-tuning**: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685)
- **Adapter config**:
  - `r=16`, `alpha=32`, `dropout=0.1`
  - LoRA applied to: `query`, `key`, `value` projection layers
- **Max sequence length**: 128 tokens
- **Mixed-precision training**: (fp16)
- **Training samples**: 44,000 sentences

---

## 🏷️ Label Schema

The model uses the standard BIO format with the following 7 labels:

| Label     | Description                       |
|-----------|-----------------------------------|
| `O`       | Outside any named entity          |
| `B-PER`   | Beginning of a person entity      |
| `I-PER`   | Inside a person entity            |
| `B-ORG`   | Beginning of an organization      |
| `I-ORG`   | Inside an organization            |
| `B-LOC`   | Beginning of a location entity    |
| `I-LOC`   | Inside a location entity          |

### 🗂️ Training-Set Concatenation  
The model was trained on a **concatenated corpus** of GermEval 2014 and WikiANN-de:

| Split | Sentences |
|-------|-----------|
| **Training** | **44 000** |
| **Evaluation** | **15 100** |

The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.

## 🚀 Getting Started

This model uses **adapter-based inference**, not a full model. Use `peft` to attach the adapter weights.

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from peft import PeftModel, PeftConfig

model_id = "fau/GermaNER"

# Define label mappings
label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {idx: label for idx, label in enumerate(label_names)}

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# Load PEFT adapter config
peft_config = PeftConfig.from_pretrained(model_id, token=True)

# Load base model with label mappings
base_model = AutoModelForTokenClassification.from_pretrained(
    peft_config.base_model_name_or_path,
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id,
    token=True
)

# Attach adapter
model = PeftModel.from_pretrained(base_model, model_id, token=True)

# Create pipeline
ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Run inference
text = "Angela Merkel war Bundeskanzlerin von Deutschland."
entities = ner_pipe(text)

for ent in entities:
    print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
```

## Files & Structure  
File | Description
---- | -----------
adapter_model.safetensors | LoRA adapter weights  
adapter_config.json | PEFT config for the adapter  
tokenizer.json | Tokenizer for XLM-Roberta  
sentencepiece.bpe.model | SentencePiece model file  
special_tokens_map.json | Special tokens config  
tokenizer_config.json | Tokenizer settings  

## 💡 Open-Source Use Cases (Hugging Face)

- **Streaming news pipelines** – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.

- **Parliament analytics** – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer.

- **Biomedical text mining** – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.

- **Conversational AI** – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.


📜 License  
This model is licensed under the Apache 2.0 License.

For questions, reach out on GitHub or Hugging Face 🤝
---

Open source contributions are welcome via:
- A `demo.ipynb` notebook
- An evaluation script using `seqeval`
- A `gr.Interface` or Streamlit demo for public inference