Update README.md

1da784b verified 6 months ago

3.98 kB

	---
	language:
	- es
	license: apache-2.0
	library_name: transformers
	base_model: dccuchile/bert-base-spanish-wwm-cased
	pipeline_tag: token-classification
	tags:
	- ner
	- token-classification
	- spanish
	- bert
	- emergencies
	- ecu-911
	datasets:
	- custom-ecu911
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: ner_model_bert_base
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: custom-ecu911
	type: custom
	split: test
	metrics:
	- type: accuracy
	value: 0.9739766081871345
	- type: f1
	name: Macro F1
	value: 0.8898766824816503
	- type: precision
	name: Macro Precision
	value: 0.8801934151701145
	- type: recall
	name: Macro Recall
	value: 0.9001920589792443
	---

	# NER for Spanish Emergency Reports (ECU-911)

	Author/Maintainer: Danny Paltin ([@dannyLeo16](https://huggingface.co/dannyLeo16))
	Task: Token Classification (NER)
	Language: Spanish (es)
	Finetuned from: `dccuchile/bert-base-spanish-wwm-cased`
	Entities (BIO): `PER` and `LOC` → `["O","B-PER","I-PER","B-LOC","I-LOC"]`

	This model is a Spanish BERT fine-tuned to identify persons and locations in short emergency incident descriptions (ECU-911-style). It was developed for the research project:

	> “Representación del conocimiento para emergencias del ECU-911 mediante PLN, ontologías OWL y reglas SWRL.”

	---

	## Model Details
	- Architecture: BERT (Whole Word Masking, cased)
	- Tokenizer: `dccuchile/bert-base-spanish-wwm-cased`
	- Max length: uses base tokenizer `model_max_length` (padding to max length)
	- Libraries: 🤗 Transformers, 🤗 Datasets, PyTorch
	- Labels: `O, B-PER, I-PER, B-LOC, I-LOC`

	---

	## Training Data
	- Source: Custom Spanish emergency reports (Ecuador, ECU-911-style) with token-level BIO annotations.
	- Size: 510 texts; 34,232 tokens (avg 67.12 tokens/text).
	- Entity counts (BIO spans): PER = 421, LOC = 1,643.
	- Token-level label distribution: `O=30,132`, `B-LOC=1,643`, `I-LOC=1,617`, `B-PER=421`, `I-PER=419`.
	- Splits: 80% train / 10% validation / 10% test (split aleatorio durante el entrenamiento).

	> Privacy/Ethics. Data should be anonymized and free of PII. Do not deploy on personal/live data without consent and compliance with local regulation.

	---

	## Training Procedure
	- Objective: Token classification (cross-entropy); continuation subwords ignored with `-100`.
	- Hyperparameters:
	- `learning_rate = 2e-5`
	- `num_train_epochs = 3`
	- `per_device_train_batch_size = 8`
	- `per_device_eval_batch_size = 8`
	- `weight_decay = 0.01`
	- `evaluation_strategy = "epoch"`, `save_strategy = "epoch"`
	- `load_best_model_at_end = true` (por `eval_loss`)
	- Data collator: `DataCollatorForTokenClassification` (padding a `max_length`)


	---

	## Evaluation
	Validation (epoch 3):
	- Accuracy: 0.9480
	- Macro F1: 0.7998
	- Macro Precision: 0.7914
	- Macro Recall: 0.8118
	- Eval loss: 0.1458

	Test:
	- Accuracy: 0.9740
	- Macro F1: 0.8899
	- Macro Precision: 0.8802
	- Macro Recall: 0.9002
	- Eval loss: 0.0834

	(Computed with `sklearn.metrics`, excluding `-100` positions.)

	---

	## Intended Use
	- NER over Spanish emergency/incident text (ECU-911-like).
	- Downstream knowledge representation (OWL/SWRL).
	- Academic research and prototyping.

	### Limitations
	- Domain-specific; performance may drop on other domains.
	- Only `PER` and `LOC` entities.
	- May struggle with colloquialisms, misspellings, or code-switching.

	---

	## How to use

	```python
	from transformers import pipeline

	ner = pipeline(
	"token-classification",
	model="dannyLeo16/ner_model_bert_base",
	tokenizer="dannyLeo16/ner_model_bert_base",
	aggregation_strategy="simple"
	)
	text = "Se reporta accidente en la Av. de las Américas con dos personas heridas."
	ner(text)