study_news_detection_english

This model is a fine-tuned version of microsoft/deberta-v3-base that was trained to identify texts that contain a reference to a research result. It was trained on a dataset of 4200 annotated print news articles from the US and UK that was created specifically for this task.

Model description

The model was trained to detect texts that mention at least one individual research result (a scientific study, e.g. a journal paper). Given (journalistic) texts, it produces a binary classification: texts are classified as either 1 (mentions a research result) or 0 (does not mention a research result). The model was developed as part of a research project at Karlsruhe Institute of Technology, which investigated journalistic coverage of individual research results. In the same project, a similar model was trained for the same task on German news articles (study_news_detection_german) as well as two models that were fine-tuned to identify journal names in German and English news articles (journal_identification_german and journal_identification_english).

Model type: text classification
Language: English
Finetuned from: microsoft/deberta-v3-base
Supported by: The author acknowledges support by the state of Baden-Württemberg through bwHPC.

Intended uses & limitations

The intended use of this model is to enable large-scale analyses of public communication about scientific research. While it has some utility on its own, it is primarily intended to be part of a larger analysis pipeline in which it serves as the first filtering step. For example, it was used to identify texts that contain a reference to a research result in a large corpus of more than one million news articles from the US and UK. These texts were the basis for several subsequent analyses that examined patterns in the journalistic coverage of research results (e.g. with regards to source or event selection).

How to use

You can use this model with a Transformers pipeline for text classification. Important: the tokenizer for DeBERTaV3 uses the SentencePiece package which has to be installed in addition to the transformers package.

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained('nikoprom/study_news_detection_english', use_fast = False)
>>> model = AutoModelForSequenceClassification.from_pretrained('nikoprom/study_news_detection_english')
>>> classifier = pipeline('text-classification', tokenizer = tokenizer, model = model)
>>> text = ['''Scientists in France have revealed that two kind of drivers cause traffic jams.
... Aggressive ones. And timid ones.
... Leaving the eventempered ones sitting there wondering why people need a degree to work stuff like this out.''']
>>> classifier(text)
[{'label': 'LABEL_1', 'score': 0.9763215184211731}]

Texts passed to the model should be complete news articles or at least paragraphs as this was the setting in which the model was fine-tuned.

Limitations

The model was developed for a very narrow use case in a research project and fine-tuned on a rather small dataset with texts from a very specific context (see below). As a consequence, its performance could be much worse when applied to texts from other domains (e.g. types of texts other than news articles, texts from other periods of time).

In addition, a very specific definition of research results was used in the creation of the training data. This definition exludes "studies" that were conducted by institutions that are not part of the science system and/or primarily out of political or economical interest (e.g. political polls, consumer surveys by market research companies).

Training data

The training data was created as part of a larger manual content analysis in which the coverage of research results in print media from three countries (Germany, UK, US) was investigated. The dataset used for this model contained 4200 articles retrieved from a press database with a broad search string that includes several research- or science-related terms. These articles were published in 72 different media outlets from the UK and US over three years (2010, 2019-2020). Each article was classified as containing no (label 0) or at least one (label 1) reference to a research result by four human coders. Intercoder reliability was satisfactory (average pairwise agreement: 94.5 %, Krippendorff's alpha: 0.76). The distribution of the labels is rather imbalanced with only 11.8 % of all articles being classified as containing at least one reference to a research result.

Training procedure

All texts were cleaned to remove some frequent formatting errors present in the original articles (e.g. â(EURO)(TM) instead of '). 64 % of the texts (2688) were used for training, 16 % (672) for validation and 20 % (840) for testing. The texts were tokenized using a SentencePiece tokenizer corresponding to the model (with a vocabulary size of 128k, without lower casing, with padding and truncation). The model was then fine-tuned using TensorFlow on a single NVIDIA Tesla V100-SXM2-32GB on the bwUniCluster 2.0. The learning rate was chosen after comparing four values (5e-6, 1e-5, 2e-5, 3e-5) to optimize accuracy in the validation set. For the final model, ten trials with identical training parameters were conducted and the model with the highest F1 score in the validation set was selected.

Training hyperparameters

The following hyperparameters were used during training:

Batch size: 8
Number of epochs: 5
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
Learning rate: 2e-5
Warmup steps: 294

Framework versions

Transformers 4.32.0
TensorFlow 2.14.0
Datasets 2.12.0
Tokenizers 0.13.3

Evaluation

The model was evaluated with a test set of 840 articles. To get a binary classification from the model output (which consists of a value between 0 and 1 for each text), a decision threshold has to be chosen. A threshold of 0.5 gives the following results:

Accuracy: 0.9440
Precision: 0.8354
Recall: 0.66
F1: 0.7374

Confusion matrix:

-	Predicted label 0	Predicted label 1
True label 0	727	13
True label 1	34	66

Other threshold values give slightly different results. If we vary the threshold in steps of 0.05 between 0.1 and 0.9, the maximum F1 score is achieved with a value of 0.35:

Precision: 0.8085
Recall: 0.76
F1: 0.7835

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for nikoprom/study_news_detection_english

Base model

microsoft/deberta-v3-base

Finetuned

(636)

this model