|
|
--- |
|
|
tags: |
|
|
- tf-keras |
|
|
- bert |
|
|
- alberto |
|
|
- multi-task-learning |
|
|
- text-classification |
|
|
- italian |
|
|
- gender-classification |
|
|
- ideology-detection |
|
|
library_name: tf-keras |
|
|
language: |
|
|
- it |
|
|
datasets: |
|
|
- custom |
|
|
--- |
|
|
|
|
|
# PIDIT: Political Ideology Detection in Italian Texts |
|
|
A Multi-Task BERT + ALBERTO Model for Gender and Ideology Prediction 🇮🇹 |
|
|
|
|
|
This `tf.keras` model combines two pre-trained encoders — `BERT` and `ALBERTO` — to perform multi-task classification on Italian-language texts. |
|
|
It is designed to predict: |
|
|
|
|
|
- **Author gender** (binary classification) |
|
|
- **Binary ideology** (e.g., progressive vs conservative) |
|
|
- **Multiclass ideology** (4 ideological classes) |
|
|
|
|
|
## ✨ Architecture |
|
|
|
|
|
- `TFBertModel` from `bert-base-italian-uncased` (frozen) |
|
|
- `TFAutoModel` from `alberto-base-uncased` (frozen) |
|
|
- Concatenated outputs + dense layers |
|
|
- Three output heads: |
|
|
- `gender`: `Dense(1, activation="sigmoid")` |
|
|
- `ideology_binary`: `Dense(1, activation="sigmoid")` |
|
|
- `ideology_multiclass`: `Dense(4, activation="softmax")` |
|
|
|
|
|
## 📥 Input |
|
|
|
|
|
The model takes **6 input tensors**: |
|
|
- `bert_input_ids`, `bert_token_type_ids`, `bert_attention_mask` |
|
|
- `alberto_input_ids`, `alberto_token_type_ids`, `alberto_attention_mask` |
|
|
|
|
|
All tensors have shape `(batch_size, max_length)`. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
### Load model and tokenizers |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
from transformers import TFBertModel, TFAutoModel |
|
|
import tensorflow as tf |
|
|
|
|
|
# Download the model locally |
|
|
model_path = snapshot_download("leeeov4/PIDIT") |
|
|
|
|
|
# Load the model |
|
|
model = tf.keras.models.load_model(model_path, custom_objects={ |
|
|
"TFBertModel": TFBertModel, |
|
|
"TFAutoModel": TFAutoModel |
|
|
}) |
|
|
|
|
|
# Load the tokenizers |
|
|
|
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
bert_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/bert_tokenizer") |
|
|
alberto_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/alberto_tokenizer") |
|
|
``` |
|
|
|
|
|
### Preprocessing Example |
|
|
|
|
|
```python |
|
|
def preprocess_text(text, max_length=250): |
|
|
bert_tokens = bert_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf') |
|
|
alberto_tokens = alberto_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf') |
|
|
|
|
|
return { |
|
|
'bert_input_ids': bert_tokens['input_ids'], |
|
|
'bert_token_type_ids': bert_tokens['token_type_ids'], |
|
|
'bert_attention_mask': bert_tokens['attention_mask'], |
|
|
'alberto_input_ids': alberto_tokens['input_ids'], |
|
|
'alberto_token_type_ids': alberto_tokens['token_type_ids'], |
|
|
'alberto_attention_mask': alberto_tokens['attention_mask'] |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
### Inference |
|
|
|
|
|
```python |
|
|
text = "Oggi, sabato 31 dicembre, alle ore 9.34, nel Monastero Mater Ecclesiae in Vaticano, il Signore ha chiamato a Sé il Santo Padre Emerito Benedetto XVI." |
|
|
inputs = preprocess_text(text) |
|
|
outputs = model.predict(inputs) |
|
|
|
|
|
gender_prob = outputs[0][0][0] |
|
|
ideology_binary_prob = outputs[1][0][0] |
|
|
ideology_multiclass_probs = outputs[2][0] |
|
|
|
|
|
print("Predicted gender (male probability):", gender_prob) |
|
|
print("Predicted binary ideology (left probability):", ideology_binary_prob) |
|
|
print("Multiclass ideology distribution (left, right, moderate left, moderate right):", ideology_multiclass_probs) |
|
|
|
|
|
|
|
|
``` |