ner_model_bert_base / README.md
dannyLeo16's picture
Update README.md
1da784b verified
---
language:
- es
license: apache-2.0
library_name: transformers
base_model: dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: token-classification
tags:
- ner
- token-classification
- spanish
- bert
- emergencies
- ecu-911
datasets:
- custom-ecu911
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: ner_model_bert_base
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: custom-ecu911
type: custom
split: test
metrics:
- type: accuracy
value: 0.9739766081871345
- type: f1
name: Macro F1
value: 0.8898766824816503
- type: precision
name: Macro Precision
value: 0.8801934151701145
- type: recall
name: Macro Recall
value: 0.9001920589792443
---
# NER for Spanish Emergency Reports (ECU-911)
**Author/Maintainer:** Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16))
**Task:** Token Classification (NER)
**Language:** Spanish (es)
**Finetuned from:** `dccuchile/bert-base-spanish-wwm-cased`
**Entities (BIO):** `PER` and `LOC``["O","B-PER","I-PER","B-LOC","I-LOC"]`
This model is a Spanish BERT fine-tuned to identify **persons** and **locations** in short emergency incident descriptions (ECU-911-style). It was developed for the research project:
> **“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”**
---
## Model Details
- **Architecture:** BERT (Whole Word Masking, cased)
- **Tokenizer:** `dccuchile/bert-base-spanish-wwm-cased`
- **Max length:** uses base tokenizer `model_max_length` (padding to max length)
- **Libraries:** 🤗 Transformers, 🤗 Datasets, PyTorch
- **Labels:** `O, B-PER, I-PER, B-LOC, I-LOC`
---
## Training Data
- **Source:** Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
- **Size:** **510** texts; **34,232** tokens (avg **67.12** tokens/text).
- **Entity counts (BIO spans):** **PER = 421**, **LOC = 1,643**.
- **Token-level label distribution:** `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`.
- **Splits:** 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).
> **Privacy/Ethics.** Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.
---
## Training Procedure
- **Objective:** Token classification (cross-entropy); continuation subwords ignored with `-100`.
- **Hyperparameters:**
- `learning_rate = 2e-5`
- `num_train_epochs = 3`
- `per_device_train_batch_size = 8`
- `per_device_eval_batch_size = 8`
- `weight_decay = 0.01`
- `evaluation_strategy = "epoch"`, `save_strategy = "epoch"`
- `load_best_model_at_end = true` *(por `eval_loss`)*
- **Data collator:** `DataCollatorForTokenClassification` (padding a `max_length`)
---
## Evaluation
**Validation (epoch 3):**
- Accuracy: **0.9480**
- Macro F1: **0.7998**
- Macro Precision: **0.7914**
- Macro Recall: **0.8118**
- Eval loss: **0.1458**
**Test:**
- Accuracy: **0.9740**
- Macro F1: **0.8899**
- Macro Precision: **0.8802**
- Macro Recall: **0.9002**
- Eval loss: **0.0834**
*(Computed with `sklearn.metrics`, excluding `-100` positions.)*
---
## Intended Use
- NER over Spanish emergency/incident text (ECU-911-like).
- Downstream knowledge representation (OWL/SWRL).
- Academic research and prototyping.
### Limitations
- Domain-specific; performance may drop on other domains.
- Only `PER` and `LOC` entities.
- May struggle with colloquialisms, misspellings, or code-switching.
---
## How to use
```python
from transformers import pipeline
ner = pipeline(
"token-classification",
model="dannyLeo16/ner_model_bert_base",
tokenizer="dannyLeo16/ner_model_bert_base",
aggregation_strategy="simple"
)
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
ner(text)