fau
/

GermaNER

@@ -1,91 +1,138 @@
 ---
 license: apache-2.0
 ---
-<table>
-  <tr>
-    <td width="80">
-      <img src="assets/ner_logo.png" alt="NER Logo" width="80"/>
-    </td>
-    <td>
-      <h1 style="margin: 0; padding: 0;">German Named Entity Recognition (GERMANER)</h1>
-    </td>
-  </tr>
-</table>
-<p align="center">
-  <em>Robust 7-class NER model for the German language, built on <code>xlm-roberta-large</code> with LoRA optimization.</em>
-</p>
----
-## 🔍 Overview
-**GermanER** is a high-performance Named Entity Recognition (NER) model tailored for the German language. It combines the multilingual power of `xlm-roberta-large` with **Parameter-Efficient Fine-Tuning (PEFT)** using **LoRA**, delivering strong results on both in-domain and out-of-domain German datasets.
-This model is fine-tuned on a hybrid dataset composed of:
-- [GermEval 2014](https://www.kaggle.com/datasets/rtatman/germaneval2014-ner)
-- [WikiANN (de)](https://huggingface.co/datasets/wikiann)
 ---
 ## 🏷️ Label Schema
-The model uses a standard BIO tagging format with 7 labels:
-| Tag    | Entity Type                            |
-|--------|----------------------------------------|
-| B-PER  | Beginning of a person entity           |
-| I-PER  | Inside a person entity                 |
-| B-ORG  | Beginning of an organization entity    |
-| I-ORG  | Inside an organization entity          |
-| B-LOC  | Beginning of a location entity         |
-| I-LOC  | Inside a location entity               |
-| O      | Outside any named entity               |
----
-## 📈 Performance
-Evaluated on a combined test set (GermEval + WikiANN):
-| Metric              | Value     |
-|---------------------|-----------|
-| **F1 Score**        | 0.8062    |
-| **Accuracy**        | 95.28%    |
-| **Validation Loss** | 0.1841    |
-| **Training Samples**| 44,000    |
-| **Epochs**          | 1         |
----
-## 🧠 Model Architecture
-- **Base Model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
-- **Fine-Tuning Strategy**: PEFT with LoRA
-- **LoRA Details**:
-  - `r=16`, `alpha=32`, `dropout=0.1`
-  - Applied to: Query, Key, and Value projection layers
-- **Sequence Length**: 128 tokens
-- **Precision**: Mixed-precision (fp16)
----
-## 🔗 Usage
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-from transformers import pipeline
-model_id = "zamal/GermaNER"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForTokenClassification.from_pretrained(model_id)
-ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
-text = "Angela Merkel war die Bundeskanzlerin von Deutschland."
-entities = ner_pipeline(text)
-print(entities)

 ---
 license: apache-2.0
+language: de
+library_name: transformers
+tags:
+  - token-classification
+  - named-entity-recognition
+  - german
+  - xlm-roberta
+  - peft
+  - lora
 ---
+# 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa
+<center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center>
+## 🔍 Overview
+**GermaNER** is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with **LoRA adapters**. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.
+> This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.
+---
+## 🧠 Architecture
+- **Base model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
+- **Fine-tuning**: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685)
+- **Adapter config**:
+  - `r=16`, `alpha=32`, `dropout=0.1`
+  - LoRA applied to: `query`, `key`, `value` projection layers
+- **Max sequence length**: 128 tokens
+- **Mixed-precision training**: ✅ (fp16)
+- **Training samples**: 44,000 sentences
+- **Epochs**: 2
 ---
 ## 🏷️ Label Schema
+The model uses the standard BIO format with the following 7 labels:
+| Label     | Description                       |
+|-----------|-----------------------------------|
+| `O`       | Outside any named entity          |
+| `B-PER`   | Beginning of a person entity      |
+| `I-PER`   | Inside a person entity            |
+| `B-ORG`   | Beginning of an organization      |
+| `I-ORG`   | Inside an organization            |
+| `B-LOC`   | Beginning of a location entity    |
+| `I-LOC`   | Inside a location entity          |
+### 🗂️ Training-Set Concatenation
+The model was trained on a **concatenated corpus** of GermEval 2014 and WikiANN-de:
+| Split | Sentences |
+|-------|-----------|
+| **Training** | **44 000** |
+| **Evaluation** | **15 100** |
+The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.
+## 🚀 Getting Started
+This model uses **adapter-based inference**, not a full model. Use `peft` to attach the adapter weights.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+from peft import PeftModel, PeftConfig
+model_id = "zamal/GermaNER"
+# Define label mappings
+label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
+label2id = {label: idx for idx, label in enumerate(label_names)}
+id2label = {idx: label for idx, label in enumerate(label_names)}
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
+# Load PEFT adapter config
+peft_config = PeftConfig.from_pretrained(model_id, token=True)
+# Load base model with label mappings
+base_model = AutoModelForTokenClassification.from_pretrained(
+    peft_config.base_model_name_or_path,
+    num_labels=len(label_names),
+    id2label=id2label,
+    label2id=label2id,
+    token=True
+)
+# Attach adapter
+model = PeftModel.from_pretrained(base_model, model_id, token=True)
+# Create pipeline
+ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+# Run inference
+text = "Angela Merkel war Bundeskanzlerin von Deutschland."
+entities = ner_pipe(text)
+for ent in entities:
+    print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
+```
+## Files & Structure
+File | Description
+---- | -----------
+adapter_model.safetensors | LoRA adapter weights
+adapter_config.json | PEFT config for the adapter
+tokenizer.json | Tokenizer for XLM-Roberta
+sentencepiece.bpe.model | SentencePiece model file
+special_tokens_map.json | Special tokens config
+tokenizer_config.json | Tokenizer settings
+## 💡 Open-Source Use Cases (Hugging Face)
+- **Streaming news pipelines** – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.
+- **Parliament analytics** – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer.
+- **Biomedical text mining** – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.
+- **Conversational AI** – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.
+📜 License
+This model is licensed under the Apache 2.0 License.
+For questions, reach out on GitHub or Hugging Face 🤝
+---
+Open source contributions are welcome via:
+- A `demo.ipynb` notebook
+- An evaluation script using `seqeval`
+- A `gr.Interface` or Streamlit demo for public inference