Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,70 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-classification
|
| 3 |
+
language:
|
| 4 |
+
- nl
|
| 5 |
+
base_model:
|
| 6 |
+
- intfloat/multilingual-e5-small
|
| 7 |
+
license: mit
|
| 8 |
+
---
|
| 9 |
+
# Model Card
|
| 10 |
+
|
| 11 |
+
This model is a fine-tuned version of [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small). It was fine-tuned on [Factrank](https://github.com/lejafar/FactRank/tree/master/factrank) data with additional samples from Dutch and Belgian parliaments tagged by GPT and Gemini. The primary goal of this model is to determine whether a given statement warrants fact-checking. It does **not** determine whether the statement is factually correct.
|
| 12 |
+
1 label is given: FR, FNR or NF.
|
| 13 |
+
|
| 14 |
+
**FR**: Factual Relevant (the statement is fact-checkable and requites verification)
|
| 15 |
+
**FNR**: Factual, Not Relevant (the statement can be fact-checked, but the wider relevance is lower)
|
| 16 |
+
**NF**: Not Factual (the statement does not contain information for fact-checking)
|
| 17 |
+
|
| 18 |
+
**Examples**:
|
| 19 |
+
- **FR**: *Toch blijkt uit cijfers van Flanders Investment & Trade dat ons handel met het Verenigd Koninkrijk opnieuw op het niveau ligt van voor de brexit.*
|
| 20 |
+
- **FNR**: *Ayleen werd opgelicht via dating fraude door de Tinder Swindler: "Het zijn net vampiers."*
|
| 21 |
+
- **NF**: *Het heeft weinig zin om zomaar een aantal maatregelen te tonen.*
|
| 22 |
+
**Supported language**: Dutch
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import pipeline
|
| 28 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
|
| 29 |
+
from huggingface_hub import login
|
| 30 |
+
|
| 31 |
+
hf_token = "insert_your_token_here"
|
| 32 |
+
login(token=hf_token)
|
| 33 |
+
config = AutoConfig.from_pretrained("textgain/FactRank_e5_small")
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained("textgain/FactRank_e5_small")
|
| 35 |
+
model = AutoModelForSequenceClassification.from_pretrained("textgain/FactRank_e5_small", config=config)
|
| 36 |
+
model.eval()
|
| 37 |
+
pipe = pipeline(model=model, tokenizer=tokenizer, task="text-classification")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
sample_texts = [
|
| 41 |
+
"In een wereld die steeds digitaler wordt, moeten we het ook makkelijker maken om de controle over je financiën te hebben.",
|
| 42 |
+
"Ik wil helemaal geen haren tussen u en de heer De Cock leggen.",
|
| 43 |
+
"Je kunt van mening verschillen over welk gevolg je daaraan moet verbinden.",
|
| 44 |
+
"We hebben 4.500 nieuwe kankergevallen in Nederland per jaar als gevolg van alcoholgebruik.",
|
| 45 |
+
"Alcoholgebruik kost de samenleving 2 tot 4 miljard euro.",
|
| 46 |
+
"Dus kan de minister daar vandaag wat meer over zeggen?"
|
| 47 |
+
]
|
| 48 |
+
|
| 49 |
+
results = pipe(sample_texts)
|
| 50 |
+
predicted_labels = [res["label"] for res in results]
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Interpretation of Results
|
| 55 |
+
**Factors Influencing the Label:**
|
| 56 |
+
- **Subjective Evaluation**: The presence of evaluations such as "interesting", "surprising", "incredible" might push the model towards predicting NF.
|
| 57 |
+
- **Research**: The mention of research or studies pushes the model to consider the statement as a verifiable fact.
|
| 58 |
+
- **Context**: Statements made in certain contexts may be more likely to get an FR label, e.g. statements about health and medicine.
|
| 59 |
+
|
| 60 |
+
## Training Details
|
| 61 |
+
The model was trained on a total of 13 786 data samples.
|
| 62 |
+
|
| 63 |
+
Parameters:
|
| 64 |
+
```python
|
| 65 |
+
num_epochs = 5
|
| 66 |
+
batch_size = 32
|
| 67 |
+
learning_rate = 1e-5
|
| 68 |
+
dropout = 0.5
|
| 69 |
+
gradient_accumulation_steps = 4
|
| 70 |
+
```
|