ner_model_bert_base / README.md
dannyLeo16's picture
Update README.md
1da784b verified
metadata
language:
  - es
license: apache-2.0
library_name: transformers
base_model: dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: token-classification
tags:
  - ner
  - token-classification
  - spanish
  - bert
  - emergencies
  - ecu-911
datasets:
  - custom-ecu911
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: ner_model_bert_base
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: custom-ecu911
          type: custom
          split: test
        metrics:
          - type: accuracy
            value: 0.9739766081871345
          - type: f1
            name: Macro F1
            value: 0.8898766824816503
          - type: precision
            name: Macro Precision
            value: 0.8801934151701145
          - type: recall
            name: Macro Recall
            value: 0.9001920589792443

NER for Spanish Emergency Reports (ECU-911)

Author/Maintainer: Danny Paltin (@dannyLeo16)
Task: Token Classification (NER)
Language: Spanish (es)
Finetuned from: dccuchile/bert-base-spanish-wwm-cased
Entities (BIO): PER and LOC["O","B-PER","I-PER","B-LOC","I-LOC"]

This model is a Spanish BERT fine-tuned to identify persons and locations in short emergency incident descriptions (ECU-911-style). It was developed for the research project:

“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”


Model Details

  • Architecture: BERT (Whole Word Masking, cased)
  • Tokenizer: dccuchile/bert-base-spanish-wwm-cased
  • Max length: uses base tokenizer model_max_length (padding to max length)
  • Libraries: 🤗 Transformers, 🤗 Datasets, PyTorch
  • Labels: O, B-PER, I-PER, B-LOC, I-LOC

Training Data

  • Source: Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
  • Size: 510 texts; 34,232 tokens (avg 67.12 tokens/text).
  • Entity counts (BIO spans): PER = 421, LOC = 1,643.
  • Token-level label distribution: O=30,132, B-LOC=1,643, I-LOC=1,617, B-PER=421, I-PER=419.
  • Splits: 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).

Privacy/Ethics. Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.


Training Procedure

  • Objective: Token classification (cross-entropy); continuation subwords ignored with -100.
  • Hyperparameters:
    • learning_rate = 2e-5
    • num_train_epochs = 3
    • per_device_train_batch_size = 8
    • per_device_eval_batch_size = 8
    • weight_decay = 0.01
    • evaluation_strategy = "epoch", save_strategy = "epoch"
    • load_best_model_at_end = true (por eval_loss)
  • Data collator: DataCollatorForTokenClassification (padding a max_length)

Evaluation

Validation (epoch 3):

  • Accuracy: 0.9480
  • Macro F1: 0.7998
  • Macro Precision: 0.7914
  • Macro Recall: 0.8118
  • Eval loss: 0.1458

Test:

  • Accuracy: 0.9740
  • Macro F1: 0.8899
  • Macro Precision: 0.8802
  • Macro Recall: 0.9002
  • Eval loss: 0.0834

(Computed with sklearn.metrics, excluding -100 positions.)


Intended Use

  • NER over Spanish emergency/incident text (ECU-911-like).
  • Downstream knowledge representation (OWL/SWRL).
  • Academic research and prototyping.

Limitations

  • Domain-specific; performance may drop on other domains.
  • Only PER and LOC entities.
  • May struggle with colloquialisms, misspellings, or code-switching.

How to use

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="dannyLeo16/ner_model_bert_base",
    tokenizer="dannyLeo16/ner_model_bert_base",
    aggregation_strategy="simple"
)
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
ner(text)