zamal commited on
Commit
3766006
·
verified ·
1 Parent(s): b3c2211

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -60
README.md CHANGED
@@ -1,91 +1,138 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- <table>
6
- <tr>
7
- <td width="80">
8
- <img src="assets/ner_logo.png" alt="NER Logo" width="80"/>
9
- </td>
10
- <td>
11
- <h1 style="margin: 0; padding: 0;">German Named Entity Recognition (GERMANER)</h1>
12
- </td>
13
- </tr>
14
- </table>
15
 
 
16
 
17
- <p align="center">
18
- <em>Robust 7-class NER model for the German language, built on <code>xlm-roberta-large</code> with LoRA optimization.</em>
19
- </p>
20
 
21
- ---
22
 
23
- ## 🔍 Overview
24
 
25
- **GermanER** is a high-performance Named Entity Recognition (NER) model tailored for the German language. It combines the multilingual power of `xlm-roberta-large` with **Parameter-Efficient Fine-Tuning (PEFT)** using **LoRA**, delivering strong results on both in-domain and out-of-domain German datasets.
26
 
27
- This model is fine-tuned on a hybrid dataset composed of:
28
 
29
- - [GermEval 2014](https://www.kaggle.com/datasets/rtatman/germaneval2014-ner)
30
- - [WikiANN (de)](https://huggingface.co/datasets/wikiann)
 
 
 
 
 
 
 
31
 
32
  ---
33
 
34
  ## 🏷️ Label Schema
35
 
36
- The model uses a standard BIO tagging format with 7 labels:
37
 
38
- | Tag | Entity Type |
39
- |--------|----------------------------------------|
40
- | B-PER | Beginning of a person entity |
41
- | I-PER | Inside a person entity |
42
- | B-ORG | Beginning of an organization entity |
43
- | I-ORG | Inside an organization entity |
44
- | B-LOC | Beginning of a location entity |
45
- | I-LOC | Inside a location entity |
46
- | O | Outside any named entity |
47
 
48
- ---
 
49
 
50
- ## 📈 Performance
 
 
 
51
 
52
- Evaluated on a combined test set (GermEval + WikiANN):
53
 
54
- | Metric | Value |
55
- |---------------------|-----------|
56
- | **F1 Score** | 0.8062 |
57
- | **Accuracy** | 95.28% |
58
- | **Validation Loss** | 0.1841 |
59
- | **Training Samples**| 44,000 |
60
- | **Epochs** | 1 |
61
 
62
- ---
63
 
64
- ## 🧠 Model Architecture
 
 
65
 
66
- - **Base Model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
67
- - **Fine-Tuning Strategy**: PEFT with LoRA
68
- - **LoRA Details**:
69
- - `r=16`, `alpha=32`, `dropout=0.1`
70
- - Applied to: Query, Key, and Value projection layers
71
- - **Sequence Length**: 128 tokens
72
- - **Precision**: Mixed-precision (fp16)
73
 
74
- ---
 
 
 
75
 
76
- ## 🔗 Usage
 
77
 
78
- ```python
79
- from transformers import AutoTokenizer, AutoModelForTokenClassification
80
- from transformers import pipeline
81
 
82
- model_id = "zamal/GermaNER"
 
 
 
 
 
 
 
83
 
84
- tokenizer = AutoTokenizer.from_pretrained(model_id)
85
- model = AutoModelForTokenClassification.from_pretrained(model_id)
86
 
87
- ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
- text = "Angela Merkel war die Bundeskanzlerin von Deutschland."
90
- entities = ner_pipeline(text)
91
- print(entities)
 
 
1
  ---
2
  license: apache-2.0
3
+ language: de
4
+ library_name: transformers
5
+ tags:
6
+ - token-classification
7
+ - named-entity-recognition
8
+ - german
9
+ - xlm-roberta
10
+ - peft
11
+ - lora
12
  ---
13
 
14
+ # 🇩🇪 GermaNER: Adapter-Based NER for German using XLM-RoBERTa
 
 
 
 
 
 
 
 
 
15
 
16
+ <center><img src="assets/ner_logo.png" alt="NER Logo" width="200" style="margin-bottom:-90px;"/></center>
17
 
18
+ ## 🔍 Overview
 
 
19
 
20
+ **GermaNER** is a high-performance Named Entity Recognition (NER) model built on top of `xlm-roberta-large` and fine-tuned using the [PEFT](https://github.com/huggingface/peft) framework with **LoRA adapters**. It supports 7 entity classes using the BIO tagging scheme and is optimized for both in-domain and general-domain German texts.
21
 
22
+ > This model is lightweight (adapter-only) and requires attaching the LoRA adapter to the base model for inference.
23
 
24
+ ---
25
 
26
+ ## 🧠 Architecture
27
 
28
+ - **Base model**: [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large)
29
+ - **Fine-tuning**: Parameter-Efficient Fine-Tuning (PEFT) using [LoRA](https://arxiv.org/abs/2106.09685)
30
+ - **Adapter config**:
31
+ - `r=16`, `alpha=32`, `dropout=0.1`
32
+ - LoRA applied to: `query`, `key`, `value` projection layers
33
+ - **Max sequence length**: 128 tokens
34
+ - **Mixed-precision training**: ✅ (fp16)
35
+ - **Training samples**: 44,000 sentences
36
+ - **Epochs**: 2
37
 
38
  ---
39
 
40
  ## 🏷️ Label Schema
41
 
42
+ The model uses the standard BIO format with the following 7 labels:
43
 
44
+ | Label | Description |
45
+ |-----------|-----------------------------------|
46
+ | `O` | Outside any named entity |
47
+ | `B-PER` | Beginning of a person entity |
48
+ | `I-PER` | Inside a person entity |
49
+ | `B-ORG` | Beginning of an organization |
50
+ | `I-ORG` | Inside an organization |
51
+ | `B-LOC` | Beginning of a location entity |
52
+ | `I-LOC` | Inside a location entity |
53
 
54
+ ### 🗂️ Training-Set Concatenation
55
+ The model was trained on a **concatenated corpus** of GermEval 2014 and WikiANN-de:
56
 
57
+ | Split | Sentences |
58
+ |-------|-----------|
59
+ | **Training** | **44 000** |
60
+ | **Evaluation** | **15 100** |
61
 
62
+ The datasets were token-aligned to the BIO scheme and merged before shuffling, ensuring a balanced distribution of domain-specific (news & Wikipedia) entity mentions across both splits.
63
 
64
+ ## 🚀 Getting Started
 
 
 
 
 
 
65
 
66
+ This model uses **adapter-based inference**, not a full model. Use `peft` to attach the adapter weights.
67
 
68
+ ```python
69
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
70
+ from peft import PeftModel, PeftConfig
71
 
72
+ model_id = "zamal/GermaNER"
 
 
 
 
 
 
73
 
74
+ # Define label mappings
75
+ label_names = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
76
+ label2id = {label: idx for idx, label in enumerate(label_names)}
77
+ id2label = {idx: label for idx, label in enumerate(label_names)}
78
 
79
+ # Load tokenizer
80
+ tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
81
 
82
+ # Load PEFT adapter config
83
+ peft_config = PeftConfig.from_pretrained(model_id, token=True)
 
84
 
85
+ # Load base model with label mappings
86
+ base_model = AutoModelForTokenClassification.from_pretrained(
87
+ peft_config.base_model_name_or_path,
88
+ num_labels=len(label_names),
89
+ id2label=id2label,
90
+ label2id=label2id,
91
+ token=True
92
+ )
93
 
94
+ # Attach adapter
95
+ model = PeftModel.from_pretrained(base_model, model_id, token=True)
96
 
97
+ # Create pipeline
98
+ ner_pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
99
+
100
+ # Run inference
101
+ text = "Angela Merkel war Bundeskanzlerin von Deutschland."
102
+ entities = ner_pipe(text)
103
+
104
+ for ent in entities:
105
+ print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
106
+ ```
107
+
108
+ ## Files & Structure
109
+ File | Description
110
+ ---- | -----------
111
+ adapter_model.safetensors | LoRA adapter weights
112
+ adapter_config.json | PEFT config for the adapter
113
+ tokenizer.json | Tokenizer for XLM-Roberta
114
+ sentencepiece.bpe.model | SentencePiece model file
115
+ special_tokens_map.json | Special tokens config
116
+ tokenizer_config.json | Tokenizer settings
117
+
118
+ ## 💡 Open-Source Use Cases (Hugging Face)
119
+
120
+ - **Streaming news pipelines** – Deploy `transformers` NER via the `pipeline("ner")` API inside a Kafka → Faust stream-processor. Emit annotated JSON to OpenSearch/Elastic and visualise in Kibana dashboards—all built from OSS components.
121
+
122
+ - **Parliament analytics** – Load Bundestag & Länder transcripts with `datasets.load_dataset`, tag entities in batch with a `TokenClassificationPipeline`, then export triples to Neo4j via the OSS `graphdatascience` driver and expose them through a GraphQL layer.
123
+
124
+ - **Biomedical text mining** – Ingest open German clinical-trial registries (e.g. from Hugging Face Hub) into Spark; call the NER model on RDD partitions to extract drug-gene-disease mentions, feeding a downstream pharmacovigilance workflow—entirely with Apache-licensed libraries.
125
+
126
+ - **Conversational AI** – Attach the LoRA adapter with `PeftModel` and serve through the HF `text-classification-inference` server. Connect to Rasa 3 (open source) using the HTTPIntentClassifier for real-time slot-filling and context hand-off in German customer-support chatbots.
127
+
128
+
129
+ 📜 License
130
+ This model is licensed under the Apache 2.0 License.
131
+
132
+ For questions, reach out on GitHub or Hugging Face 🤝
133
+ ---
134
 
135
+ Open source contributions are welcome via:
136
+ - A `demo.ipynb` notebook
137
+ - An evaluation script using `seqeval`
138
+ - A `gr.Interface` or Streamlit demo for public inference