tdickson17 commited on
Commit
4df1b82
·
verified ·
1 Parent(s): 0dcd4c8

Upload test_model.ipynb with huggingface_hub

Browse files
Files changed (1) hide show
  1. test_model.ipynb +161 -0
test_model.ipynb ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: summarization
4
+ ---
5
+ # Populism Detection & Summarization
6
+
7
+ This checkpoint is a BART-based, LoRA-fine-tuned model that does two things:
8
+
9
+ Summarizes party press releases (and, when relevant, explains where populist framing appears), and
10
+
11
+ Classifies whether the text contains populist language (Is_Populist ∈ {0,1}).
12
+
13
+ Weights here are the merged LoRA result—no adapters required.
14
+
15
+ The model was trained on ~10k official party press releases from 12 countries (Italy, Sweden, Switzerland, Netherlands, Germany, Denmark, Spain, UK, Austria, Poland, Ireland, France) that were labeled and summarized via a Palantir AIP Ontology step using GPT-4o.
16
+
17
+ ## Model Details
18
+
19
+ Pretrained Model: facebook/bart-base (seq2seq) fine-tuned with LoRA and then merged.
20
+ Instruction Framing: Two prefixes:
21
+
22
+ Summarize: summarize: <original_text>
23
+
24
+ Classify: classify_populism: <original_text> → model outputs 0 or 1 (or you can argmax over first decoder step logits for tokens “0” vs “1”).
25
+
26
+ Tokenization: BART’s subword tokenizer (Byte-Pair Encoding).
27
+
28
+ Input Processing: Text is truncated to 1024 tokens; summaries capped at 128 tokens.
29
+
30
+ Output Generation (summarization): beam search (typically 5 beams), mild length penalty, and no-repeat bigrams to reduce redundancy.
31
+
32
+ Key Parameters:
33
+
34
+ Max Input Length: 1024 tokens — fits long releases while controlling memory.
35
+
36
+ Max Target Length: 128 tokens — concise summaries with good coverage.
37
+
38
+ Beam Search: ~5 beams — balances quality and speed.
39
+
40
+ Classification Decoding: read the first generated token (0/1) or take first-step logits for a deterministic argmax.
41
+
42
+ Generation Process (high level)
43
+
44
+ Input Tokenization: Convert text to subwords and build the encoder input.
45
+
46
+ Beam Search (summarize): Explore multiple candidate sequences, pick the most probable.
47
+
48
+ Output Decoding: Map token IDs back to text, skipping special tokens.
49
+
50
+ Model Hub: tdickson17/Populism_detection
51
+
52
+ Repository: https://github.com/tcdickson/Populism.git
53
+
54
+ ## Training Details
55
+
56
+ Data Collection:
57
+ Press releases were scraped from official party websites to capture formal statements and policy messaging. A Palantir AIP Ontology step (powered by GPT-4o) produced:
58
+
59
+ Is_Populist (binary) — whether the text exhibits populist framing (e.g., “people vs. elites,” anti-institutional rhetoric).
60
+
61
+ Summaries/Explanations — concise abstracts; when populism is present, the text explains where/how it appears.
62
+
63
+ Preprocessing:
64
+ HTML/boilerplate removal, normalization, and formatting into pairs:
65
+
66
+ Input: original release text (title optional at inference)
67
+
68
+ Targets: (a) abstract summary/explanation, (b) binary label
69
+
70
+ Training Objective:
71
+ Supervised fine-tuning for joint tasks:
72
+
73
+ Abstractive summarization (seq2seq cross-entropy)
74
+
75
+ Binary classification (decoded 0/1 via the same seq2seq head)
76
+
77
+ Training Strategy:
78
+
79
+ Base: facebook/bart-base
80
+
81
+ Method: LoRA on attention/FFN blocks (r=16, α=32, dropout=0.05), then merged into base.
82
+
83
+ Decoding: beam search for summaries; argmax or short generation for labels.
84
+
85
+ Evaluation signals: ROUGE for summaries; Accuracy/Precision/Recall/F1 for classification.
86
+
87
+ This setup lets one checkpoint handle both analysis (populism flag) and explanation (summary) with simple instruction prefixes.
88
+
89
+ ## Usage:
90
+
91
+ install dependency:
92
+ Bash: pip install transformers
93
+
94
+ then run:
95
+
96
+ import torch
97
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
98
+
99
+ MODEL_ID = "tdickson17/Populism_detection"
100
+ device = "cuda" if torch.cuda.is_available() else "cpu"
101
+
102
+ tok = AutoTokenizer.from_pretrained(MODEL_ID)
103
+ model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID).to(device).eval()
104
+
105
+ MAX_SRC, MAX_SUM = 1024, 128
106
+ DEC_START = model.config.decoder_start_token_id
107
+ ID0 = tok("0", add_special_tokens=False)["input_ids"][0]
108
+ ID1 = tok("1", add_special_tokens=False)["input_ids"][0]
109
+
110
+ THRESHOLD = 0.5 # raise for higher precision, lower for higher recall
111
+ POSITIVE_MSG = "This text DOES contain populist sentiment.\n"
112
+ NEGATIVE_MSG = "Populist sentiment is NOT detected in this text.\n"
113
+
114
+ GEN_SUM = dict(
115
+ do_sample=False, num_beams=5,
116
+ max_new_tokens=MAX_SUM, min_new_tokens=16,
117
+ length_penalty=1.1, no_repeat_ngram_size=3
118
+ )
119
+
120
+ @torch.no_grad()
121
+ def summarize(text: str) -> str:
122
+ enc = tok("summarize: " + text, return_tensors="pt",
123
+ truncation=True, max_length=MAX_SRC).to(device)
124
+ out = model.generate(**enc, **GEN_SUM)
125
+ s = tok.decode(out[0], skip_special_tokens=True).strip()
126
+ if s.lower().startswith("summarize:"):
127
+ s = s.split(":", 1)[1].strip()
128
+ return s
129
+
130
+ @torch.no_grad()
131
+ def classify_populism_prob(text: str) -> float:
132
+ enc = tok("classify_populism: " + text, return_tensors="pt",
133
+ truncation=True, max_length=MAX_SRC).to(device)
134
+ dec_inp = torch.tensor([[DEC_START]], device=device)
135
+ logits = model(**enc, decoder_input_ids=dec_inp, use_cache=False).logits[:, -1, :]
136
+
137
+ two = torch.stack([logits[:, ID0], logits[:, ID1]], dim=-1)
138
+ p1 = torch.softmax(two, dim=-1)[0, 1].item()
139
+ return p1
140
+
141
+ def classify_populism_label(text: str, threshold: float = THRESHOLD, include_probability: bool = True) -> str:
142
+ p1 = classify_populism_prob(text)
143
+ msg = POSITIVE_MSG if p1 >= threshold else NEGATIVE_MSG
144
+ return f"{msg} Confidence={p1:.3f}%" if include_probability else msg
145
+
146
+ # Example
147
+ text = """<Insert Text here>"""
148
+ print(classify_populism_label(text))
149
+ print("\nSummary:\n", summarize(text))
150
+
151
+
152
+
153
+ ## Citation:
154
+
155
+ @article{dickson2024going,
156
+ title={Going against the grain: Climate change as a wedge issue for the radical right},
157
+ author={Dickson, Zachary P and Hobolt, Sara B},
158
+ journal={Comparative Political Studies},
159
+ year={2024},
160
+ publisher={SAGE Publications Sage CA: Los Angeles, CA}
161
+ }