|
|
--- |
|
|
language: |
|
|
- es |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
base_model: dccuchile/bert-base-spanish-wwm-cased |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- ner |
|
|
- token-classification |
|
|
- spanish |
|
|
- bert |
|
|
- emergencies |
|
|
- ecu-911 |
|
|
datasets: |
|
|
- custom-ecu911 |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: ner_model_bert_base |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
dataset: |
|
|
name: custom-ecu911 |
|
|
type: custom |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.9739766081871345 |
|
|
- type: f1 |
|
|
name: Macro F1 |
|
|
value: 0.8898766824816503 |
|
|
- type: precision |
|
|
name: Macro Precision |
|
|
value: 0.8801934151701145 |
|
|
- type: recall |
|
|
name: Macro Recall |
|
|
value: 0.9001920589792443 |
|
|
--- |
|
|
|
|
|
# NER for Spanish Emergency Reports (ECU-911) |
|
|
|
|
|
**Author/Maintainer:** Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16)) |
|
|
**Task:** Token Classification (NER) |
|
|
**Language:** Spanish (es) |
|
|
**Finetuned from:** `dccuchile/bert-base-spanish-wwm-cased` |
|
|
**Entities (BIO):** `PER` and `LOC` → `["O","B-PER","I-PER","B-LOC","I-LOC"]` |
|
|
|
|
|
This model is a Spanish BERT fine-tuned to identify **persons** and **locations** in short emergency incident descriptions (ECU-911-style). It was developed for the research project: |
|
|
|
|
|
> **“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”** |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
- **Architecture:** BERT (Whole Word Masking, cased) |
|
|
- **Tokenizer:** `dccuchile/bert-base-spanish-wwm-cased` |
|
|
- **Max length:** uses base tokenizer `model_max_length` (padding to max length) |
|
|
- **Libraries:** 🤗 Transformers, 🤗 Datasets, PyTorch |
|
|
- **Labels:** `O, B-PER, I-PER, B-LOC, I-LOC` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
- **Source:** Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations. |
|
|
- **Size:** **510** texts; **34,232** tokens (avg **67.12** tokens/text). |
|
|
- **Entity counts (BIO spans):** **PER = 421**, **LOC = 1,643**. |
|
|
- **Token-level label distribution:** `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`. |
|
|
- **Splits:** 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento). |
|
|
|
|
|
> **Privacy/Ethics.** Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Procedure |
|
|
- **Objective:** Token classification (cross-entropy); continuation subwords ignored with `-100`. |
|
|
- **Hyperparameters:** |
|
|
- `learning_rate = 2e-5` |
|
|
- `num_train_epochs = 3` |
|
|
- `per_device_train_batch_size = 8` |
|
|
- `per_device_eval_batch_size = 8` |
|
|
- `weight_decay = 0.01` |
|
|
- `evaluation_strategy = "epoch"`, `save_strategy = "epoch"` |
|
|
- `load_best_model_at_end = true` *(por `eval_loss`)* |
|
|
- **Data collator:** `DataCollatorForTokenClassification` (padding a `max_length`) |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
**Validation (epoch 3):** |
|
|
- Accuracy: **0.9480** |
|
|
- Macro F1: **0.7998** |
|
|
- Macro Precision: **0.7914** |
|
|
- Macro Recall: **0.8118** |
|
|
- Eval loss: **0.1458** |
|
|
|
|
|
**Test:** |
|
|
- Accuracy: **0.9740** |
|
|
- Macro F1: **0.8899** |
|
|
- Macro Precision: **0.8802** |
|
|
- Macro Recall: **0.9002** |
|
|
- Eval loss: **0.0834** |
|
|
|
|
|
*(Computed with `sklearn.metrics`, excluding `-100` positions.)* |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
- NER over Spanish emergency/incident text (ECU-911-like). |
|
|
- Downstream knowledge representation (OWL/SWRL). |
|
|
- Academic research and prototyping. |
|
|
|
|
|
### Limitations |
|
|
- Domain-specific; performance may drop on other domains. |
|
|
- Only `PER` and `LOC` entities. |
|
|
- May struggle with colloquialisms, misspellings, or code-switching. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
ner = pipeline( |
|
|
"token-classification", |
|
|
model="dannyLeo16/ner_model_bert_base", |
|
|
tokenizer="dannyLeo16/ner_model_bert_base", |
|
|
aggregation_strategy="simple" |
|
|
) |
|
|
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas." |
|
|
ner(text) |
|
|
|