Durrani95 commited on
Commit
ea0cb4c
·
verified ·
1 Parent(s): 0d3b0b2

Add fine-tuned EuroBERT for binary geopolitical classification

Browse files
Files changed (1) hide show
  1. README_eurobert_geopol_binary.md +127 -0
README_eurobert_geopol_binary.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: text-classification
4
+ base_model: EuroBERT/EuroBERT-210m
5
+ base_model_relation: finetune
6
+ tags:
7
+ - eurobert
8
+ - fine-tuned
9
+ - transformers
10
+ - pytorch
11
+ - sequence-classification
12
+ - binary-classification
13
+ - geopolitics
14
+ - multilingual
15
+ language:
16
+ - en
17
+ - de
18
+ - fr
19
+ - es
20
+ - it
21
+ ---
22
+
23
+
24
+ # EuroBERT Geopolitical Classifier (Binary)
25
+
26
+ Fine-tuned `EuroBERT/EuroBERT-210m` for **binary** classification of geopolitical tension in European news text.
27
+
28
+ - **Task:** Sequence classification (binary)
29
+ - **Labels:** `non_geopolitical` (0), `geopolitical` (1)
30
+ - **Intended use:** Detects whether an article reflects geopolitical tension (best performance on full article-level text)
31
+ - **Languages:** English, German, French, Spanish, Italian
32
+ - **Framework:** 🤗 Transformers (PyTorch)
33
+
34
+ ---
35
+
36
+ ## Quick start
37
+
38
+ ### Inference with `transformers`
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
42
+ import torch
43
+
44
+ model_id = "Durrani95/eurobert-geopolitical-binary"
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
47
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
48
+
49
+ texts = [
50
+ "Energy Sanctions Deepen Divide Between Western Bloc and Major Oil Exporters.",
51
+ "Military Exercises Near Disputed Waters Raise Fears of Regional Escalations.",
52
+
53
+ ]
54
+
55
+ inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
56
+
57
+ with torch.no_grad():
58
+ logits = model(**inputs).logits
59
+ probs = torch.softmax(logits, dim=1)
60
+
61
+ for text, p in zip(texts, probs):
62
+ label_id = int(p.argmax())
63
+ label = model.config.id2label[label_id]
64
+ confidence = float(p[label_id])
65
+ print(f"{label:>16} {confidence:6.2%} | {text}")
66
+ ```
67
+
68
+
69
+ ---
70
+
71
+ ## Labels
72
+
73
+ ```json
74
+ {
75
+ "0": "non_geopolitical",
76
+ "1": "geopolitical"
77
+ }
78
+ ```
79
+
80
+ You may apply a decision threshold (e.g., `score >= 0.5`) depending on your precision/recall trade-off.
81
+
82
+ ---
83
+
84
+ ## Training & Evaluation
85
+
86
+ - **Base model:** `EuroBERT/EuroBERT-210m`
87
+ - **Objective:** Cross-entropy (binary)
88
+ - **Data:** European news text labeled for geopolitical relevance
89
+ - **Hardware:** A100 GPU
90
+ - **Epochs:** 1
91
+ - **Optimizer:** AdamW with linear scheduler
92
+ - **Metrics (validation set):**
93
+
94
+ | Metric | Score |
95
+ |:-------|------:|
96
+ | Accuracy | 0.95 |
97
+ | F1-score | 0.95 |
98
+ | Precision | 0.93 |
99
+ | Recall | 0.97 |
100
+
101
+ ### Training setup
102
+
103
+ | Parameter | Value |
104
+ |------------|--------|
105
+ | Learning rate | 3e-5 |
106
+ | Desired (effective) batch size | 64 |
107
+ | Actual GPU batch size | 16 |
108
+ | Gradient accumulation | 4 steps |
109
+ | Weight decay | 1e-5 |
110
+ | Betas | (0.9, 0.95) |
111
+ | Epsilon | 1e-8 |
112
+ | Max epochs | 1 |
113
+ |
114
+
115
+ ---
116
+
117
+ ## Limitations & Risks
118
+
119
+ - May be sensitive to domain shift (non-news, social media text)
120
+ - Class imbalance can affect thresholding; calibrate on your validation data
121
+ - Multilingual performance can vary across languages and registers
122
+
123
+ ---
124
+
125
+ ## How to cite
126
+
127
+ If you use this model, please cite this repository and the EuroBERT base model.