globalise
/

NER-base

Token Classification

digital-humanities

Model card Files Files and versions

NER-base / README.md

sarnoult's picture

Upload tokenizer

ac578e3 verified 12 months ago

|

history blame contribute delete

3.83 kB

	---
	language:
	- nl
	license: mit
	tags:
	- digital-humanities
	- token-classification
	base_model:
	- FacebookAI/xlm-roberta-base
	---

	# Model Card for NER-base

	[Globalise](https://globalise.huygens.knaw.nl/) NER token-classification model, development version.

	## Model Details

	### Model Description

	This is the first version of a NER model developed for the Globalise project.

	- Developed by: Sophie Arnoult
	- Shared by: Globalise Team
	- Funded by: NWO
	- Model type: token classification

	## Uses

	Named-Entity tagging of historical (17th-18th century), VOC-related Dutch documents.


	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->
	The texts the model was fine-tuned on are heavily biased, representing colonial standpoints. While care has been taken in designing the labelset and annotating the data, biases may remain when applying the model on similar data; the model has not been tested on other data.

	This is a development version. The training and development data consist of [VOC missives](https://research.vu.nl/en/datasets/voc-gm-ner-corpus) data enriched with new annotations. Most entity types used in Globalise are not present in the VOC missives data, while the new annotations are limited in number. Performance on these may therefore not be representative.


	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	The training and development data consist of
	- GM NER corpus ([datasplit-all-standard](https://data.yoda.vu.nl:9443/vault-fgw-llc-vocmissives/voc_gm_ner%5B1670857835%5D/original/datasplit_all_standard/), train/dev data), where labels are mapped to their Globalise equivalents
	- Globalise annotated data (first set of annotations, to be extended and published at a later date)

	The data are pretokenized with [Spacy](https://spacy.io/models/nl#nl_core_news_lg). Sequences are split at 240 word tokens.

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Training Hyperparameters

	- Training regime: fp32
	- Optimizer: Adam, learning rate 3e-5
	- max-sequence-length: 512
	- batch size: 32
	- max-epochs: 20


	## Evaluation

	Model selected based on validation weighted multiclass F1 score, using a single seed.
	<!-- This section describes the evaluation protocols and provides the results. -->


	### Results
	label \| precision \| recall \| f1-score \| support \|
	\| --- \| --- \| --- \|--- \| --- \|
	CMTY_NAME\| 0.72\| 0.80\| 0.76\| 109
	CMTY_QUAL\| 1.00\| 0.67\| 0.80\| 9
	CMTY_QUANT\| 0.76\| 0.85\| 0.80\| 66
	DATE\| 0.48\| 0.53\| 0.51\| 43
	DOC\| 0.61\| 0.55\| 0.58\| 20
	ETH_REL\| 0.78\| 0.81\| 0.79\| 31
	LOC_ADJ\| 0.91\| 0.96\| 0.94\| 464
	LOC_NAME\| 0.91\| 0.94\| 0.92\| 1324
	ORG\| 0.92\| 0.87\| 0.89\| 265
	PER_ATTR\| 0.69\| 0.82\| 0.75\| 44
	PER_NAME\| 0.80\| 0.87\| 0.83\| 613
	PRF\| 0.70\| 0.76\| 0.73\| 97
	SHIP\| 0.89\| 0.86\| 0.87\| 519
	SHIP_TYPE\| 0.79\| 0.82\| 0.81\| 33
	STATUS\| 0.96\| 0.96\| 0.96\| 27
	micro avg \| 0.86 \| 0.89 \| 0.88 \| 3664
	macro avg \| 0.79 \| 0.80 \| 0.80 \| 3664
	weighted avg \| 0.86 \| 0.89 \| 0.88 \| 3664


	## Technical Specifications


	### Compute Infrastructure

	SURF Snellius

	#### Hardware

	A100