GermaNER / README.md

zamalali

Sync README.md with university remote

b3a319b 8 months ago

5.37 kB

	---
	license: apache-2.0
	language: de
	library_name: transformers
	tags:
	- token-classification
	- named-entity-recognition
	- german
	- xlm-roberta
	- peft
	- lora
	---

	# 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa

	<center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center>

	## 🔍 Overview

	GermaNER is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with LoRA adapters. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.

	> This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.

	---

	## 🧠 Architecture

	- Base model: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
	- Fine-tuning: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685)
	- Adapter config:
	- `r=16`, `alpha=32`, `dropout=0.1`
	- LoRA applied to: `query`, `key`, `value` projection layers
	- Max sequence length: 128 tokens
	- Mixed-precision training: (fp16)
	- Training samples: 44,000 sentences

	---

	## 🏷️ Label Schema

	The model uses the standard BIO format with the following 7 labels:

	\| Label \| Description \|
	\|-----------\|-----------------------------------\|
	\| `O` \| Outside any named entity \|
	\| `B-PER` \| Beginning of a person entity \|
	\| `I-PER` \| Inside a person entity \|
	\| `B-ORG` \| Beginning of an organization \|
	\| `I-ORG` \| Inside an organization \|
	\| `B-LOC` \| Beginning of a location entity \|
	\| `I-LOC` \| Inside a location entity \|

	### 🗂️ Training-Set Concatenation
	The model was trained on a concatenated corpus of GermEval 2014 and WikiANN-de:

	\| Split \| Sentences \|
	\|-------\|-----------\|
	\| Training \| 44 000 \|
	\| Evaluation \| 15 100 \|

	The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.

	## 🚀 Getting Started

	This model uses adapter-based inference, not a full model. Use `peft` to attach the adapter weights.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
	from peft import PeftModel, PeftConfig

	model_id = "fau/GermaNER"

	# Define label mappings
	label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
	label2id = {label: idx for idx, label in enumerate(label_names)}
	id2label = {idx: label for idx, label in enumerate(label_names)}

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

	# Load PEFT adapter config
	peft_config = PeftConfig.from_pretrained(model_id, token=True)

	# Load base model with label mappings
	base_model = AutoModelForTokenClassification.from_pretrained(
	peft_config.base_model_name_or_path,
	num_labels=len(label_names),
	id2label=id2label,
	label2id=label2id,
	token=True
	)

	# Attach adapter
	model = PeftModel.from_pretrained(base_model, model_id, token=True)

	# Create pipeline
	ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	# Run inference
	text = "Angela Merkel war Bundeskanzlerin von Deutschland."
	entities = ner_pipe(text)

	for ent in entities:
	print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
	```

	## Files & Structure
	File \| Description
	---- \| -----------
	adapter_model.safetensors \| LoRA adapter weights
	adapter_config.json \| PEFT config for the adapter
	tokenizer.json \| Tokenizer for XLM-Roberta
	sentencepiece.bpe.model \| SentencePiece model file
	special_tokens_map.json \| Special tokens config
	tokenizer_config.json \| Tokenizer settings

	## 💡 Open-Source Use Cases (Hugging Face)

	- Streaming news pipelines – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.

	- Parliament analytics – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer.

	- Biomedical text mining – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.

	- Conversational AI – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.


	📜 License
	This model is licensed under the Apache 2.0 License.

	For questions, reach out on GitHub or Hugging Face 🤝
	---

	Open source contributions are welcome via:
	- A `demo.ipynb` notebook
	- An evaluation script using `seqeval`
	- A `gr.Interface` or Streamlit demo for public inference