--- language: - es license: apache-2.0 library_name: transformers base_model: dccuchile/bert-base-spanish-wwm-cased pipeline_tag: token-classification tags: - ner - token-classification - spanish - bert - emergencies - ecu-911 datasets: - custom-ecu911 metrics: - accuracy - f1 - precision - recall model-index: - name: ner_model_bert_base results: - task: type: token-classification name: Named Entity Recognition dataset: name: custom-ecu911 type: custom split: test metrics: - type: accuracy value: 0.9739766081871345 - type: f1 name: Macro F1 value: 0.8898766824816503 - type: precision name: Macro Precision value: 0.8801934151701145 - type: recall name: Macro Recall value: 0.9001920589792443 --- # NER for Spanish Emergency Reports (ECU-911) **Author/Maintainer:** Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16)) **Task:** Token Classification (NER) **Language:** Spanish (es) **Finetuned from:** `dccuchile/bert-base-spanish-wwm-cased` **Entities (BIO):** `PER` and `LOC` → `["O","B-PER","I-PER","B-LOC","I-LOC"]` This model is a Spanish BERT fine-tuned to identify **persons** and **locations** in short emergency incident descriptions (ECU-911-style). It was developed for the research project: > **“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”** --- ## Model Details - **Architecture:** BERT (Whole Word Masking, cased) - **Tokenizer:** `dccuchile/bert-base-spanish-wwm-cased` - **Max length:** uses base tokenizer `model_max_length` (padding to max length) - **Libraries:** 🤗 Transformers, 🤗 Datasets, PyTorch - **Labels:** `O, B-PER, I-PER, B-LOC, I-LOC` --- ## Training Data - **Source:** Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations. - **Size:** **510** texts; **34,232** tokens (avg **67.12** tokens/text). - **Entity counts (BIO spans):** **PER = 421**, **LOC = 1,643**. - **Token-level label distribution:** `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`. - **Splits:** 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento). > **Privacy/Ethics.** Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation. --- ## Training Procedure - **Objective:** Token classification (cross-entropy); continuation subwords ignored with `-100`. - **Hyperparameters:** - `learning_rate = 2e-5` - `num_train_epochs = 3` - `per_device_train_batch_size = 8` - `per_device_eval_batch_size = 8` - `weight_decay = 0.01` - `evaluation_strategy = "epoch"`, `save_strategy = "epoch"` - `load_best_model_at_end = true` *(por `eval_loss`)* - **Data collator:** `DataCollatorForTokenClassification` (padding a `max_length`) --- ## Evaluation **Validation (epoch 3):** - Accuracy: **0.9480** - Macro F1: **0.7998** - Macro Precision: **0.7914** - Macro Recall: **0.8118** - Eval loss: **0.1458** **Test:** - Accuracy: **0.9740** - Macro F1: **0.8899** - Macro Precision: **0.8802** - Macro Recall: **0.9002** - Eval loss: **0.0834** *(Computed with `sklearn.metrics`, excluding `-100` positions.)* --- ## Intended Use - NER over Spanish emergency/incident text (ECU-911-like). - Downstream knowledge representation (OWL/SWRL). - Academic research and prototyping. ### Limitations - Domain-specific; performance may drop on other domains. - Only `PER` and `LOC` entities. - May struggle with colloquialisms, misspellings, or code-switching. --- ## How to use ```python from transformers import pipeline ner = pipeline( "token-classification", model="dannyLeo16/ner_model_bert_base", tokenizer="dannyLeo16/ner_model_bert_base", aggregation_strategy="simple" ) text = "Se reporta accidente en la Av. de las Américas con dos personas heridas." ner(text)