ner_model_bert_base / README.md

dannyLeo16

Update README.md

1da784b verified 6 months ago

preview code

raw

history blame contribute delete

3.98 kB

metadata

language:
  - es
license: apache-2.0
library_name: transformers
base_model: dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: token-classification
tags:
  - ner
  - token-classification
  - spanish
  - bert
  - emergencies
  - ecu-911
datasets:
  - custom-ecu911
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: ner_model_bert_base
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: custom-ecu911
          type: custom
          split: test
        metrics:
          - type: accuracy
            value: 0.9739766081871345
          - type: f1
            name: Macro F1
            value: 0.8898766824816503
          - type: precision
            name: Macro Precision
            value: 0.8801934151701145
          - type: recall
            name: Macro Recall
            value: 0.9001920589792443

NER for Spanish Emergency Reports (ECU-911)

Author/Maintainer: Danny Paltin (@dannyLeo16)
Task: Token Classification (NER)
Language: Spanish (es)
Finetuned from: dccuchile/bert-base-spanish-wwm-cased
Entities (BIO): PER and LOC → ["O","B-PER","I-PER","B-LOC","I-LOC"]

This model is a Spanish BERT fine-tuned to identify persons and locations in short emergency incident descriptions (ECU-911-style). It was developed for the research project:

“Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”

Model Details

Architecture: BERT (Whole Word Masking, cased)
Tokenizer: dccuchile/bert-base-spanish-wwm-cased
Max length: uses base tokenizer model_max_length (padding to max length)
Libraries: 🤗 Transformers, 🤗 Datasets, PyTorch
Labels: O, B-PER, I-PER, B-LOC, I-LOC

Training Data

Source: Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
Size: 510 texts; 34,232 tokens (avg 67.12 tokens/text).
Entity counts (BIO spans): PER = 421, LOC = 1,643.
Token-level label distribution: O=30,132, B-LOC=1,643, I-LOC=1,617, B-PER=421, I-PER=419.
Splits: 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).

Privacy/Ethics. Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.

Training Procedure

Objective: Token classification (cross-entropy); continuation subwords ignored with -100.
Hyperparameters:
- learning_rate = 2e-5
- num_train_epochs = 3
- per_device_train_batch_size = 8
- per_device_eval_batch_size = 8
- weight_decay = 0.01
- evaluation_strategy = "epoch", save_strategy = "epoch"
- load_best_model_at_end = true (por eval_loss)
Data collator: DataCollatorForTokenClassification (padding a max_length)

Evaluation

Validation (epoch 3):

Accuracy: 0.9480
Macro F1: 0.7998
Macro Precision: 0.7914
Macro Recall: 0.8118
Eval loss: 0.1458

Test:

Accuracy: 0.9740
Macro F1: 0.8899
Macro Precision: 0.8802
Macro Recall: 0.9002
Eval loss: 0.0834

(Computed with sklearn.metrics, excluding -100 positions.)

Intended Use

NER over Spanish emergency/incident text (ECU-911-like).
Downstream knowledge representation (OWL/SWRL).
Academic research and prototyping.

Limitations

Domain-specific; performance may drop on other domains.
Only PER and LOC entities.
May struggle with colloquialisms, misspellings, or code-switching.

How to use

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="dannyLeo16/ner_model_bert_base",
    tokenizer="dannyLeo16/ner_model_bert_base",
    aggregation_strategy="simple"
)
text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
ner(text)