PIDIT / README.md
leeeov4's picture
Update README.md
b0fca2e verified
---
tags:
- tf-keras
- bert
- alberto
- multi-task-learning
- text-classification
- italian
- gender-classification
- ideology-detection
library_name: tf-keras
language:
- it
datasets:
- custom
---
# PIDIT: Political Ideology Detection in Italian Texts
A Multi-Task BERT + ALBERTO Model for Gender and Ideology Prediction 🇮🇹
This `tf.keras` model combines two pre-trained encoders — `BERT` and `ALBERTO` — to perform multi-task classification on Italian-language texts.
It is designed to predict:
- **Author gender** (binary classification)
- **Binary ideology** (e.g., progressive vs conservative)
- **Multiclass ideology** (4 ideological classes)
## ✨ Architecture
- `TFBertModel` from `bert-base-italian-uncased` (frozen)
- `TFAutoModel` from `alberto-base-uncased` (frozen)
- Concatenated outputs + dense layers
- Three output heads:
- `gender`: `Dense(1, activation="sigmoid")`
- `ideology_binary`: `Dense(1, activation="sigmoid")`
- `ideology_multiclass`: `Dense(4, activation="softmax")`
## 📥 Input
The model takes **6 input tensors**:
- `bert_input_ids`, `bert_token_type_ids`, `bert_attention_mask`
- `alberto_input_ids`, `alberto_token_type_ids`, `alberto_attention_mask`
All tensors have shape `(batch_size, max_length)`.
---
## 🚀 Usage
### Load model and tokenizers
```python
from huggingface_hub import snapshot_download
from transformers import TFBertModel, TFAutoModel
import tensorflow as tf
# Download the model locally
model_path = snapshot_download("leeeov4/PIDIT")
# Load the model
model = tf.keras.models.load_model(model_path, custom_objects={
"TFBertModel": TFBertModel,
"TFAutoModel": TFAutoModel
})
# Load the tokenizers
from transformers import AutoTokenizer
bert_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/bert_tokenizer")
alberto_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/alberto_tokenizer")
```
### Preprocessing Example
```python
def preprocess_text(text, max_length=250):
bert_tokens = bert_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
alberto_tokens = alberto_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
return {
'bert_input_ids': bert_tokens['input_ids'],
'bert_token_type_ids': bert_tokens['token_type_ids'],
'bert_attention_mask': bert_tokens['attention_mask'],
'alberto_input_ids': alberto_tokens['input_ids'],
'alberto_token_type_ids': alberto_tokens['token_type_ids'],
'alberto_attention_mask': alberto_tokens['attention_mask']
}
```
### Inference
```python
text = "Oggi, sabato 31 dicembre, alle ore 9.34, nel Monastero Mater Ecclesiae in Vaticano, il Signore ha chiamato a Sé il Santo Padre Emerito Benedetto XVI."
inputs = preprocess_text(text)
outputs = model.predict(inputs)
gender_prob = outputs[0][0][0]
ideology_binary_prob = outputs[1][0][0]
ideology_multiclass_probs = outputs[2][0]
print("Predicted gender (male probability):", gender_prob)
print("Predicted binary ideology (left probability):", ideology_binary_prob)
print("Multiclass ideology distribution (left, right, moderate left, moderate right):", ideology_multiclass_probs)
```