|
|
---
|
|
|
language:
|
|
|
- de
|
|
|
library_name: tflite
|
|
|
tags:
|
|
|
- named-entity-recognition
|
|
|
- ner
|
|
|
- german
|
|
|
- tflite
|
|
|
- on-device
|
|
|
- mobile
|
|
|
- android
|
|
|
- ios
|
|
|
datasets:
|
|
|
- GermanEval/germeval_14
|
|
|
base_model: deepset/gelectra-large
|
|
|
pipeline_tag: token-classification
|
|
|
license: mit
|
|
|
---
|
|
|
|
|
|
# MobAnon NER Model
|
|
|
|
|
|
German Named Entity Recognition model for the [MobAnon](https://github.com/jurasoft/JURA-KI-Anonymer-Mobile) document anonymization app. Fine-tuned from [deepset/gelectra-large](https://huggingface.co/deepset/gelectra-large) on [GermEval14](https://huggingface.co/datasets/GermanEval/germeval_14) for on-device inference.
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
| Property | Value |
|
|
|
|----------|-------|
|
|
|
| Base model | deepset/gelectra-large |
|
|
|
| Training data | GermEval14 (German NER) |
|
|
|
| Format | TensorFlow Lite (float16 quantized) |
|
|
|
| Size | ~638 MB |
|
|
|
| Test F1 | ~87-89% |
|
|
|
| Max sequence length | 128 tokens |
|
|
|
|
|
|
## Entity Types
|
|
|
|
|
|
The model detects four semantic entity types using BIO tagging:
|
|
|
|
|
|
| Entity | Examples |
|
|
|
|--------|----------|
|
|
|
| **PERSON** | Max Mustermann, Dr. Schmidt |
|
|
|
| **ORGANIZATION** | Deutsche Bank, Bundesgerichtshof |
|
|
|
| **LOCATION** | Frankfurt, Deutschland, Berliner Str. |
|
|
|
| **MISC** | Events, dates, other named entities |
|
|
|
|
|
|
MobAnon supplements these with regex-based detection for structured entities (email, phone, IBAN, identifiers).
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
This model is downloaded automatically by the MobAnon app on first use. No manual setup required.
|
|
|
|
|
|
### Direct download
|
|
|
|
|
|
```bash
|
|
|
# Via huggingface-cli
|
|
|
huggingface-cli download PaulCamacho/mobanon-models deepseek.tflite
|
|
|
|
|
|
# Via URL
|
|
|
wget https://huggingface.co/PaulCamacho/mobanon-models/resolve/main/deepseek.tflite
|
|
|
```
|
|
|
|
|
|
### Input/Output Specification
|
|
|
|
|
|
| Tensor | Shape | Type | Description |
|
|
|
|--------|-------|------|-------------|
|
|
|
| `input_ids` | [1, 128] | int32 | Tokenized input IDs |
|
|
|
| `attention_mask` | [1, 128] | int32 | Attention mask |
|
|
|
| `logits` | [1, 128, 9] | float32 | Per-token logits for 9 BIO labels |
|
|
|
|
|
|
### Labels
|
|
|
|
|
|
| Index | Label | Entity |
|
|
|
|-------|-------|--------|
|
|
|
| 0 | O | Outside |
|
|
|
| 1 | B-PER | Begin Person |
|
|
|
| 2 | I-PER | Inside Person |
|
|
|
| 3 | B-ORG | Begin Organization |
|
|
|
| 4 | I-ORG | Inside Organization |
|
|
|
| 5 | B-LOC | Begin Location |
|
|
|
| 6 | I-LOC | Inside Location |
|
|
|
| 7 | B-MISC | Begin Miscellaneous |
|
|
|
| 8 | I-MISC | Inside Miscellaneous |
|
|
|
|
|
|
## Training
|
|
|
|
|
|
```bash
|
|
|
cd base_model
|
|
|
python train_ner.py --epochs 3 --batch-size 16 --fp16
|
|
|
python export_to_onnx.py --static-shapes
|
|
|
python convert_to_tflite.py --quantize float16
|
|
|
```
|
|
|
|
|
|
See the [base_model README](https://github.com/jurasoft/JURA-KI-Anonymer-Mobile/tree/main/base_model) for the full training and conversion pipeline.
|
|
|
|
|
|
## License
|
|
|
|
|
|
MIT
|
|
|
|