File size: 3,979 Bytes
6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 1da784b 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb 6c58479 69d91cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language:
- es
license: apache-2.0
library_name: transformers
base_model: dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: token-classification
tags:
- ner
- token-classification
- spanish
- bert
- emergencies
- ecu-911
datasets:
- custom-ecu911
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: ner_model_bert_base
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: custom-ecu911
type: custom
split: test
metrics:
- type: accuracy
value: 0.9739766081871345
- type: f1
name: Macro F1
value: 0.8898766824816503
- type: precision
name: Macro Precision
value: 0.8801934151701145
- type: recall
name: Macro Recall
value: 0.9001920589792443
---
# NER for Spanish Emergency Reports (ECU-911)
**Author/Maintainer:** Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16))
**Task:** Token Classification (NER)
**Language:** Spanish (es)
**Finetuned from:** `dccuchile/bert-base-spanish-wwm-cased`
**Entities (BIO):** `PER` and `LOC` → `["O","B-PER","I-PER","B-LOC","I-LOC"]`
This model is a Spanish BERT fine-tuned to identify **persons** and **locations** in short emergency incident descriptions (ECU-911-style). It was developed for the research project:
> **“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”**
---
## Model Details
- **Architecture:** BERT (Whole Word Masking, cased)
- **Tokenizer:** `dccuchile/bert-base-spanish-wwm-cased`
- **Max length:** uses base tokenizer `model_max_length` (padding to max length)
- **Libraries:** 🤗 Transformers, 🤗 Datasets, PyTorch
- **Labels:** `O, B-PER, I-PER, B-LOC, I-LOC`
---
## Training Data
- **Source:** Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
- **Size:** **510** texts; **34,232** tokens (avg **67.12** tokens/text).
- **Entity counts (BIO spans):** **PER = 421**, **LOC = 1,643**.
- **Token-level label distribution:** `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`.
- **Splits:** 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).
> **Privacy/Ethics.** Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.
---
## Training Procedure
- **Objective:** Token classification (cross-entropy); continuation subwords ignored with `-100`.
- **Hyperparameters:**
- `learning_rate = 2e-5`
- `num_train_epochs = 3`
- `per_device_train_batch_size = 8`
- `per_device_eval_batch_size = 8`
- `weight_decay = 0.01`
- `evaluation_strategy = "epoch"`, `save_strategy = "epoch"`
- `load_best_model_at_end = true` *(por `eval_loss`)*
- **Data collator:** `DataCollatorForTokenClassification` (padding a `max_length`)
---
## Evaluation
**Validation (epoch 3):**
- Accuracy: **0.9480**
- Macro F1: **0.7998**
- Macro Precision: **0.7914**
- Macro Recall: **0.8118**
- Eval loss: **0.1458**
**Test:**
- Accuracy: **0.9740**
- Macro F1: **0.8899**
- Macro Precision: **0.8802**
- Macro Recall: **0.9002**
- Eval loss: **0.0834**
*(Computed with `sklearn.metrics`, excluding `-100` positions.)*
---
## Intended Use
- NER over Spanish emergency/incident text (ECU-911-like).
- Downstream knowledge representation (OWL/SWRL).
- Academic research and prototyping.
### Limitations
- Domain-specific; performance may drop on other domains.
- Only `PER` and `LOC` entities.
- May struggle with colloquialisms, misspellings, or code-switching.
---
## How to use
```python
from transformers import pipeline
ner = pipeline(
"token-classification",
model="dannyLeo16/ner_model_bert_base",
tokenizer="dannyLeo16/ner_model_bert_base",
aggregation_strategy="simple"
)
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
ner(text)
|