journal_identification_english

This model is a fine-tuned version of bert-base-cased that was trained to identify and extract references to scientific journals in English language news coverage. It was trained on a dataset of 9378 annotated paragraphs from US and UK print news articles that was created specifically for this task.

Model description

Similarly to a Named Entity Recognition model, this model has been trained to detect a specific type of entity in texts: scientific journals. Individual tokens in a text are classified as either irrelevant (no journal name), the first part of a journal name or a later part of a journal name. The model was developed as part of a research project at Karlsruhe Institute of Technology, which investigated journalistic coverage of individual research results. In the same project, a similar model was trained to identify journal names in German news articles (journal_identification_german) as well as two models that were fine-tuned to detect German and English news articles that contain a reference to a research result (study_news_detection_german and study_news_detection_english).

Model type: token classification
Language: English
Finetuned from: bert-base-cased
Supported by: The author acknowledges support by the state of Baden-Württemberg through bwHPC.

Intended uses & limitations

The intended use of this model is to enable large-scale analyses of the journalistic selection of scientific journals as sources for their coverage. It was used to extract journal names from more than 80k news articles from the UK and more than 32k news articles from the US to study the dominance of individual sources in science news coverage.

How to use

You can use this model with a Transformers pipeline for token classification:

>>> from transformers import pipeline
>>> journal_identifier = pipeline('token-classification', model = 'nikoprom/journal_identification_english')
>>> sentences = ['The study, in BMJ, controlled for age, race, education and many diet, health and behavioral characteristics.']
>>> journal_identifier(sentences)

[{'entity': 'J-Start',
  'score': np.float32(0.96135074),
  'index': 5,
  'word': 'B',
  'start': 14,
  'end': 15},
 {'entity': 'J-Start',
  'score': np.float32(0.9271554),
  'index': 6,
  'word': '##M',
  'start': 15,
  'end': 16},
 {'entity': 'J-Start',
  'score': np.float32(0.8430263),
  'index': 7,
  'word': '##J',
  'start': 16,
  'end': 17}]

Text passed to the model should consist of whole paragraphs or at least sentences as this was the setting in which the model was fine-tuned.

Limitations

The model was developed for a very narrow use case in a research project and fine-tuned on a rather small dataset with texts from a very specific context (see below). As a consequence, its performance could be much worse when applied to texts from other domains (e.g. types of texts other than news articles, texts from other periods of time).

In addition, model output should be checked and post-processed before further use for at least two reasons: Sometimes, only some subwords of a journal name are tagged as journal names. In related cases, tokens inside a journal name are occasionally not identified as a part of the name, leading to the detection of two separate names.

Training data

The training data was created as part of a larger manual content analysis in which the coverage of research results in print media from three countries (Germany, UK, US) was investigated. The dataset used for this model contained 495 articles mentioning a specific research result. These articles were published in 72 different media outlets from the UK and US over three years (2010, 2019-2020). All names of scientific journals (e.g. Nature, Cell Metabolism, PNAS) or preprint servers (e.g. medRxiv, SSRN) in the texts were marked by four human coders. Based on these annotations, each token was classified into one of three classes:

Label	Class
O	No journal name
J-Start	First word of a journal name
J-Inner	Second (or later) word of a journal name

Training procedure

All texts were cleaned to remove some frequent formatting errors present in the original articles (e.g. â(EURO)(TM) instead of '). Each text was split into paragraphs based on line breaks, paragraphs containing more than 300 words were additionally split into sentences (to ensure that their number of tokens would not exceed the maximum length accepted by the model). 64 % of the paragraphs (6002) were used for training, 16 % (1500) for validation and 20 % (1876) for testing. Further preprocessing and fine-tuning largely followed the steps outlined in the notebook "Fine-tuning a model on a token classification task" provided by HuggingFace. The paragraphs were tokenized using a WordPiece tokenizer corresponding to the model (with a vocabulary size of 28,996 and without lower casing). As words that are not in the vocabulary are split into subwords with this tokenizer, the journal labels had to be aligned with the new tokens. The model was then fine-tuned using TensorFlow on a single NVIDIA Tesla V100-SXM2-32GB on the bwUniCluster 2.0. For the final model, ten trials with identical training parameters were conducted and the model with the highest F1 score in the validation set was selected.

Training hyperparameters

The following hyperparameters were used during training:

Batch size: 16
Number of epochs: 15
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
Learning rate: 2e-5
Weight decay rate: 0.01

Framework versions

Transformers 4.32.0
TensorFlow 2.14.0
Datasets 2.12.0
Tokenizers 0.13.3

Evaluation

The model was evaluated with a test set of 1876 paragraphs using precision, recall and F1 score (calculated using seqeval):

Class	Precision	Recall	F1
J-Start	0.931	0.931	0.931
J-Inner	0.783	0.783	0.783
Overall	0.865	0.865	0.865

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for nikoprom/journal_identification_english

Base model

google-bert/bert-base-cased

Finetuned

(2910)

this model