File size: 3,979 Bytes
6c58479
69d91cb
 
 
6c58479
69d91cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c58479
 
69d91cb
6c58479
69d91cb
 
 
 
 
6c58479
69d91cb
6c58479
69d91cb
6c58479
69d91cb
6c58479
69d91cb
 
 
 
 
 
6c58479
69d91cb
6c58479
69d91cb
 
 
 
 
 
6c58479
69d91cb
6c58479
69d91cb
6c58479
69d91cb
 
 
 
 
 
 
 
 
 
 
1da784b
6c58479
69d91cb
6c58479
 
69d91cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c58479
69d91cb
6c58479
69d91cb
 
 
 
6c58479
69d91cb
 
 
 
6c58479
69d91cb
6c58479
69d91cb
6c58479
69d91cb
 
6c58479
69d91cb
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
- es
license: apache-2.0
library_name: transformers
base_model: dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: token-classification
tags:
- ner
- token-classification
- spanish
- bert
- emergencies
- ecu-911
datasets:
- custom-ecu911
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: ner_model_bert_base
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: custom-ecu911
      type: custom
      split: test
    metrics:
    - type: accuracy
      value: 0.9739766081871345
    - type: f1
      name: Macro F1
      value: 0.8898766824816503
    - type: precision
      name: Macro Precision
      value: 0.8801934151701145
    - type: recall
      name: Macro Recall
      value: 0.9001920589792443
---

# NER for Spanish Emergency Reports (ECU-911)

**Author/Maintainer:** Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16))  
**Task:** Token Classification (NER)  
**Language:** Spanish (es)  
**Finetuned from:** `dccuchile/bert-base-spanish-wwm-cased`  
**Entities (BIO):** `PER` and `LOC``["O","B-PER","I-PER","B-LOC","I-LOC"]`

This model is a Spanish BERT fine-tuned to identify **persons** and **locations** in short emergency incident descriptions (ECU-911-style). It was developed for the research project:

> **“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”**

---

## Model Details
- **Architecture:** BERT (Whole Word Masking, cased)
- **Tokenizer:** `dccuchile/bert-base-spanish-wwm-cased`
- **Max length:** uses base tokenizer `model_max_length` (padding to max length)
- **Libraries:** 🤗 Transformers, 🤗 Datasets, PyTorch
- **Labels:** `O, B-PER, I-PER, B-LOC, I-LOC`

---

## Training Data
- **Source:** Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
- **Size:** **510** texts; **34,232** tokens (avg **67.12** tokens/text).
- **Entity counts (BIO spans):** **PER = 421**, **LOC = 1,643**.
- **Token-level label distribution:** `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`.
- **Splits:** 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).

> **Privacy/Ethics.** Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.

---

## Training Procedure
- **Objective:** Token classification (cross-entropy); continuation subwords ignored with `-100`.
- **Hyperparameters:**
  - `learning_rate = 2e-5`
  - `num_train_epochs = 3`
  - `per_device_train_batch_size = 8`
  - `per_device_eval_batch_size = 8`
  - `weight_decay = 0.01`
  - `evaluation_strategy = "epoch"`, `save_strategy = "epoch"`
  - `load_best_model_at_end = true` *(por `eval_loss`)*
- **Data collator:** `DataCollatorForTokenClassification` (padding a `max_length`)


---

## Evaluation
**Validation (epoch 3):**  
- Accuracy: **0.9480**  
- Macro F1: **0.7998**  
- Macro Precision: **0.7914**  
- Macro Recall: **0.8118**  
- Eval loss: **0.1458**

**Test:**  
- Accuracy: **0.9740**  
- Macro F1: **0.8899**  
- Macro Precision: **0.8802**  
- Macro Recall: **0.9002**  
- Eval loss: **0.0834**

*(Computed with `sklearn.metrics`, excluding `-100` positions.)*

---

## Intended Use
- NER over Spanish emergency/incident text (ECU-911-like).
- Downstream knowledge representation (OWL/SWRL).
- Academic research and prototyping.

### Limitations
- Domain-specific; performance may drop on other domains.
- Only `PER` and `LOC` entities.
- May struggle with colloquialisms, misspellings, or code-switching.

---

## How to use

```python
from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="dannyLeo16/ner_model_bert_base",
    tokenizer="dannyLeo16/ner_model_bert_base",
    aggregation_strategy="simple"
)
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
ner(text)