leeeov4 commited on
Commit
19fdbb7
ยท
verified ยท
1 Parent(s): 031a339

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -17
README.md CHANGED
@@ -15,49 +15,95 @@ datasets:
15
  - custom
16
  ---
17
 
18
- # PIDIT: Modello Multi-Task BERT + ALBERTO per analisi ideologica e di genere ๐Ÿ‡ฎ๐Ÿ‡น
19
 
20
- Questo modello `tf.keras` unisce due encoder pre-addestrati (`BERT` e `ALBERTO`) per effettuare predizioni multi-task su testi in italiano.
21
- รˆ progettato per classificare:
22
 
23
- - ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ **Genere** dell'autore (binary classification)
24
- - ๐Ÿ›๏ธ **Ideologia binaria** (es. conservatore vs progressista)
25
- - ๐Ÿงญ **Ideologia multiclasse** (4 classi ideologiche)
26
 
27
- ## โœจ Architettura
28
 
29
- - `TFBertModel` da `bert-base-italian-uncased` (non fine-tuned)
30
- - `TFAutoModel` da `alberto-base-uncased` (non fine-tuned)
31
- - Layer di concatenazione e densi condivisi
32
- - 3 teste di output:
33
  - `gender`: `Dense(1, activation="sigmoid")`
34
  - `ideology_binary`: `Dense(1, activation="sigmoid")`
35
  - `ideology_multiclass`: `Dense(4, activation="softmax")`
36
 
37
  ## ๐Ÿ“ฅ Input
38
 
39
- Il modello accetta **6 input**:
40
  - `bert_input_ids`, `bert_token_type_ids`, `bert_attention_mask`
41
  - `alberto_input_ids`, `alberto_token_type_ids`, `alberto_attention_mask`
42
 
43
- Tutti con shape `(batch_size, max_length)`.
44
 
45
  ---
46
 
47
- ## ๐Ÿš€ Utilizzo
48
 
49
- ### 1. Caricamento del modello
50
 
51
  ```python
52
  from huggingface_hub import snapshot_download
53
  from transformers import TFBertModel, TFAutoModel
54
  import tensorflow as tf
55
 
56
- # Scarica localmente il modello
57
  model_path = snapshot_download("leeeov4/PIDIT")
58
 
59
- # Carica il modello
60
  model = tf.keras.models.load_model(model_path, custom_objects={
61
  "TFBertModel": TFBertModel,
62
  "TFAutoModel": TFAutoModel
63
  })
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  - custom
16
  ---
17
 
18
+ # PIDIT: Multi-Task BERT + ALBERTO Model for Gender and Ideology Prediction ๐Ÿ‡ฎ๐Ÿ‡น
19
 
20
+ This `tf.keras` model combines two pre-trained encoders โ€” `BERT` and `ALBERTO` โ€” to perform multi-task classification on Italian-language texts.
21
+ It is designed to predict:
22
 
23
+ - ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ **Author gender** (binary classification)
24
+ - ๐Ÿ›๏ธ **Binary ideology** (e.g., progressive vs conservative)
25
+ - ๐Ÿงญ **Multiclass ideology** (4 ideological classes)
26
 
27
+ ## โœจ Architecture
28
 
29
+ - `TFBertModel` from `bert-base-italian-uncased` (frozen)
30
+ - `TFAutoModel` from `alberto-base-uncased` (frozen)
31
+ - Concatenated outputs + dense layers
32
+ - Three output heads:
33
  - `gender`: `Dense(1, activation="sigmoid")`
34
  - `ideology_binary`: `Dense(1, activation="sigmoid")`
35
  - `ideology_multiclass`: `Dense(4, activation="softmax")`
36
 
37
  ## ๐Ÿ“ฅ Input
38
 
39
+ The model takes **6 input tensors**:
40
  - `bert_input_ids`, `bert_token_type_ids`, `bert_attention_mask`
41
  - `alberto_input_ids`, `alberto_token_type_ids`, `alberto_attention_mask`
42
 
43
+ All tensors have shape `(batch_size, max_length)`.
44
 
45
  ---
46
 
47
+ ## ๐Ÿš€ Usage
48
 
49
+ ### 1. Load the model
50
 
51
  ```python
52
  from huggingface_hub import snapshot_download
53
  from transformers import TFBertModel, TFAutoModel
54
  import tensorflow as tf
55
 
56
+ # Download the model locally
57
  model_path = snapshot_download("leeeov4/PIDIT")
58
 
59
+ # Load the model
60
  model = tf.keras.models.load_model(model_path, custom_objects={
61
  "TFBertModel": TFBertModel,
62
  "TFAutoModel": TFAutoModel
63
  })
64
+ ```
65
+
66
+ ### 2. Load the tokenizers
67
+
68
+ ```python
69
+ from transformers import AutoTokenizer
70
+
71
+ bert_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/bert_tokenizer")
72
+ alberto_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/alberto_tokenizer")
73
+ ```
74
+
75
+ ## ๐Ÿงผ Preprocessing Example
76
+
77
+ ```python
78
+ def preprocess_text(text, max_length=250):
79
+ bert_tokens = bert_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
80
+ alberto_tokens = alberto_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
81
+
82
+ return {
83
+ 'bert_input_ids': bert_tokens['input_ids'],
84
+ 'bert_token_type_ids': bert_tokens['token_type_ids'],
85
+ 'bert_attention_mask': bert_tokens['attention_mask'],
86
+ 'alberto_input_ids': alberto_tokens['input_ids'],
87
+ 'alberto_token_type_ids': alberto_tokens['token_type_ids'],
88
+ 'alberto_attention_mask': alberto_tokens['attention_mask']
89
+ }
90
+
91
+ ```
92
+
93
+
94
+ ## ๐Ÿงผ Inference
95
+
96
+ ```python
97
+ text = "Questo รจ un esempio di testo italiano per testare il modello."
98
+ inputs = preprocess_text(text)
99
+ outputs = model.predict(inputs)
100
+
101
+ gender_prob = outputs[0][0][0]
102
+ ideology_binary_prob = outputs[1][0][0]
103
+ ideology_multiclass_probs = outputs[2][0]
104
+
105
+ print("Predicted gender (male probability):", gender_prob)
106
+ print("Predicted binary ideology (conservative probability):", ideology_binary_prob)
107
+ print("Multiclass ideology distribution:", ideology_multiclass_probs)
108
+
109
+ ```