aaron-rae-nicolas commited on
Commit
3f63255
·
verified ·
1 Parent(s): 2d20e0c

Upload 6 files

Browse files
Setup Instructions for All Techniques/[SETUP] Fine-Tuning (Gemma) - General Models.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Project Dependencies and Setup Instructions
2
+
3
+ 1. Python Environment
4
+ This project requires Python 3.10 or higher.
5
+
6
+ 2. Required External Libraries
7
+ The following Python libraries are required to run the data processing and fine-tuning scripts. You can install them using pip:
8
+
9
+ pip install pandas torch transformers peft scikit-learn numpy matplotlib accelerate huggingface_hub
10
+
11
+ Library Descriptions:
12
+ - pandas: Used for data manipulation and loading CSV files.
13
+ - torch: PyTorch framework used for deep learning model training.
14
+ - transformers: Hugging Face library to load the Gemma tokenizer and model.
15
+ - peft: Parameter-Efficient Fine-Tuning (LoRA) library.
16
+ - scikit-learn: Used for calculating metrics (F1, precision, recall) and splitting data.
17
+ - numpy: Used for numerical operations and array manipulation.
18
+ - matplotlib: Used for generating training loss plots.
19
+ - accelerate: Helper library often required by Transformers for model loading.
20
+ - huggingface_hub: Required for authenticating with the Hugging Face Hub.
21
+
22
+ 3. Hugging Face Authentication (Gemma Model Access)
23
+ The scripts use the Google Gemma model (e.g., 'google/gemma-3-1b-pt'), which is a gated model. To access it, you must follow these steps:
24
+
25
+ Step A: Grant Access
26
+ 1. Go to the Hugging Face model page (https://huggingface.co/google/gemma-3-1b-pt).
27
+ 2. Log in to your Hugging Face account.
28
+ 3. Review and accept the license terms to gain access.
29
+
30
+ Step B: Authenticate in the Environment
31
+ You must provide a valid Hugging Face Access Token. You can generate one at https://huggingface.co/settings/tokens.
32
+
33
+ Option 1 (Command Line / Local):
34
+ Run the following command in your terminal before starting the script:
35
+ huggingface-cli login
36
+ (Paste your token when prompted).
37
+
38
+ Option 2 (Google Colab / Jupyter Notebook):
39
+ If running in a notebook, add a cell at the very top with the following code:
40
+
41
+ from huggingface_hub import login
42
+ login("YOUR_HUGGING_FACE_TOKEN_HERE")
43
+
44
+ 4. Hardware Requirements
45
+ The training scripts are configured to use CUDA (NVIDIA GPU). Ensure you have a GPU enabled environment (e.g., Google Colab with T4/A100 GPU selected in Runtime settings) and the appropriate CUDA drivers installed.
Setup Instructions for All Techniques/[SETUP] Fine-Tuning (Gemma) - Hierarchical.txt ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PROJECT SETUP GUIDE - RULE-BASED OR LLM (GEMINI) HIERARCHICAL MODEL EVALUATION
2
+
3
+ The following setup stated below are the same for both hierarchical codes.
4
+
5
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
6
+ SYSTEM REQUIREMENTS
7
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
8
+ Python 3.8 or higher
9
+ 16GB RAM minimum (32GB recommended)
10
+ NVIDIA GPU with 8GB+ VRAM (required for model evaluation)
11
+ Windows, Linux, or macOS
12
+ 15GB free disk space
13
+ Jupyter Notebook or JupyterLab
14
+
15
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16
+ STEP 1: INSTALL PYTHON
17
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
18
+ Download and install Python from: https://www.python.org/downloads/
19
+ During installation:
20
+
21
+ Check "Add Python to PATH"
22
+ Check "Install pip"
23
+
24
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
25
+ STEP 2: INSTALL JUPYTER NOTEBOOK
26
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
27
+ Open Command Prompt (Windows) or Terminal (Mac/Linux) and run:
28
+ pip install jupyter notebook
29
+ Or if you prefer JupyterLab:
30
+ pip install jupyterlab
31
+
32
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
33
+ STEP 3: INSTALL REQUIRED PACKAGES
34
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
35
+ Copy-paste these commands in Command Prompt, Terminal, or Cell:
36
+ pip install pandas numpy scikit-learn transformers peft huggingface-hub
37
+
38
+ For GPU support (NVIDIA only):
39
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
40
+
41
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
42
+ STEP 4: SET UP HUGGING FACE ACCOUNT
43
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
44
+ The scripts use the Google Gemma model (e.g., 'google/gemma-3-1b-pt'), which is a gated model. To access it, you must follow these steps:
45
+
46
+ Step A: Grant Access
47
+ 1. Go to the Hugging Face model page (https://huggingface.co/google/gemma-3-1b-pt).
48
+ 2. Log in to your Hugging Face account.
49
+ 3. Review and accept the license terms to gain access.
50
+
51
+ Step B: Authenticate in the Environment
52
+ You must provide a valid Hugging Face Access Token. You can generate one at https://huggingface.co/settings/tokens.
53
+
54
+ Option 1 (Command Line / Local):
55
+ Go to: https://huggingface.co/
56
+ Login or Create a free account
57
+ Go to Settings > Access Tokens
58
+ Create a new token
59
+ Install Hugging Face CLI:
60
+ pip install huggingface-hub
61
+ Login with your token:
62
+ huggingface-cli login
63
+ Paste your token when prompted
64
+
65
+ Option 2 (Google Colab / Jupyter Notebook):
66
+ If running in a notebook, add a cell at the very top with the following code:
67
+
68
+ from huggingface_hub import login
69
+ login("YOUR_HUGGING_FACE_TOKEN_HERE")
70
+
71
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
72
+ STEP 5: CREATE PROJECT FOLDERS
73
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
74
+ Create this folder structure anywhere on your computer:
75
+ your_project/
76
+ ├── datasets/
77
+ │ ├── Boolean23.csv
78
+ │ ├── test_product_dataset.csv
79
+ │ ├── test_delivery_dataset.csv
80
+ │ ├── test_service_dataset.csv
81
+ │ ├── test_price_dataset.csv
82
+ │ └── test_hierarchy.csv
83
+ ├── models/
84
+ │ ├── gemini/
85
+ │ │ ├── gemini_general.pth
86
+ │ │ ├── gemma_product_classifier.pth
87
+ │ │ ├── gemma_delivery_classifier.pth
88
+ │ │ ├── gemma_service_classifier.pth
89
+ │ │ └── gemma_price_classifier.pth
90
+ │ └── rule-based/
91
+ │ ├── rule-based_general.pth
92
+ │ ├── gemma_product_classifier.pth
93
+ │ ├── gemma_delivery_classifier.pth
94
+ │ ├── gemma_service_classifier.pth
95
+ │ └── gemma_price_classifier.pth
96
+ ├── gemini_hierarchical.ipynb
97
+ └── rule-based_hierarchical.ipynb
98
+
99
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
100
+ STEP 6: PREPARE YOUR DATA FILES
101
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
102
+ You need 6 CSV files with specific columns:
103
+
104
+ Boolean23.csv - General aspects test data
105
+ Required columns: Review, Product, Delivery, Price, Service
106
+
107
+ test_product_dataset.csv - Product-specific test data
108
+ Required columns: Review, Color_PRO, Condition_PRO, Correctness_PRO,
109
+ Durability_PRO, Effectiveness_PRO, Functionality_PRO,
110
+ Material_PRO, Sensory_PRO, Size_PRO, General_PRO
111
+
112
+ test_delivery_dataset.csv - Delivery-specific test data
113
+ Required columns: Review, Condition_DEL, Correctness_DEL, Timeliness_DEL,
114
+ General_DEL
115
+
116
+ test_service_dataset.csv - Service-specific test data
117
+ Required columns: Review, Handling_SER, Responsiveness_SER,
118
+ Trustworthiness_SER, General_SER
119
+
120
+ test_price_dataset.csv - Price-specific test data
121
+ Required columns: Review, Affordability_PRICE, Value_for_Money_PRICE,
122
+ General_PRICE
123
+
124
+ test_hierarchy.csv - Complete hierarchical test data
125
+ Required columns: Review + ALL 25 aspect columns from above
126
+
127
+ Important notes:
128
+ Review column: Text with customer feedback
129
+ All label columns: 0 or 1 (binary labels)
130
+ Column names must match exactly (case-sensitive)
131
+
132
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
133
+ STEP 7: PREPARE YOUR TRAINED MODELS
134
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
135
+ You need 10 trained model files (.pth format) in the 'models/rule-based/' and 'models/gemini/' folder:
136
+
137
+ For gemini:
138
+ gemini_general.pth - General aspect classifier
139
+ gemma_product_classifier.pth - Product-specific classifier
140
+ gemma_delivery_classifier.pth - Delivery-specific classifier
141
+ gemma_service_classifier.pth - Service-specific classifier
142
+ gemma_price_classifier.pth - Price-specific classifier
143
+
144
+ For rule-based:
145
+ rule-based_general.pth - General aspect classifier
146
+ gemma_product_classifier.pth - Product-specific classifier
147
+ gemma_delivery_classifier.pth - Delivery-specific classifier
148
+ gemma_service_classifier.pth - Service-specific classifier
149
+ gemma_price_classifier.pth - Price-specific classifier
150
+
151
+ These should be trained models from your previous training sessions.
152
+ Important: Each model file must contain:
153
+
154
+ model_state_dict: The trained model weights
155
+ optimal_thresholds OR optimized_thresholds: Decision thresholds for each label
156
+
157
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
158
+ STEP 8: LAUNCH JUPYTER NOTEBOOK
159
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
160
+ Open Command Prompt or Terminal
161
+ Navigate to your project folder:
162
+
163
+ cd C:\path\to\your_project
164
+
165
+ Launch Jupyter Notebook:
166
+
167
+ jupyter notebook
168
+ Or if using JupyterLab:
169
+ jupyter lab
170
+
171
+ Your browser will open automatically
172
+ Click on '[rule-based/gemini]_hierarchical.ipynb' to open it
173
+
174
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━
175
+ STEP 9: RUN THE NOTEBOOK
176
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
177
+ Click "Cell" in the top menu
178
+ Click "Run"
179
+ Wait for cell to complete
180
+
181
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
182
+ WHAT THE CODE DOES
183
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
184
+ PHASE 1: Individual Model Evaluation
185
+
186
+ Loads each of the 5 trained models one at a time
187
+ Evaluates each model on its specific test dataset
188
+ Calculates metrics (accuracy, precision, recall, F1-score, etc.)
189
+ Saves predictions to separate CSV files
190
+ Cleans up memory after each model
191
+
192
+ PHASE 2: Hierarchical Model Evaluation
193
+
194
+ Loads general model and predicts 4 main aspects (Product, Delivery, Service, Price)
195
+ Loads each specific model and predicts detailed sub-aspects
196
+ Applies hierarchical constraints (if general aspect = 0, all sub-aspects = 0)
197
+ Combines all predictions into complete 25-label predictions
198
+ Evaluates combined hierarchical model performance
199
+ Calculates per-aspect and overall metrics
200
+
201
+ PHASE 3: Results and Reports
202
+
203
+ Displays comprehensive metrics summary in notebook
204
+ Shows sample predictions with ground truth
205
+ Saves detailed results to CSV and text files
206
+
207
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
208
+ OUTPUT FILES
209
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
210
+ Individual Model Predictions:
211
+ '[rule-based/gemini]_general_test_predictions.csv'
212
+ '[rule-based/gemini]_product_test_predictions.csv'
213
+ '[rule-based/gemini]_delivery_test_predictions.csv'
214
+ '[rule-based/gemini]_service_test_predictions.csv'
215
+ '[rule-based/gemini]_price_test_predictions.csv'
216
+
217
+ Each contains: Original review, predicted labels, probabilities, exact match indicator
218
+
219
+ Hierarchical Model Results:
220
+ '[rule-based/gemini]_hierarchical_evaluation_results.csv'
221
+ Complete predictions with hierarchical constraints applied
222
+ '[rule-based/gemini]_hierarchical_metrics_summary.txt'
223
+
224
+ Comprehensive metrics report including:
225
+ Overall accuracy and F1 scores
226
+ Per-aspect metrics
227
+ Confusion matrices
228
+ Exact match statistics
229
+
230
+ All files will be saved in your project folder.
231
+
232
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
233
+ UNDERSTANDING THE OUTPUT
234
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
235
+ In Jupyter Notebook, you'll see output directly below each cell:
236
+ PHASE 1 Output:
237
+ Each model cell shows:
238
+ ✓ Model loading progress
239
+ ✓ Inference progress (samples processed)
240
+ ✓ Metrics summary table
241
+ ✓ Per-aspect performance breakdown
242
+ ✓ Memory cleanup confirmation
243
+
244
+ PHASE 2 Output:
245
+ Hierarchical evaluation shows:
246
+ ✓ Step-by-step progress (7 steps)
247
+ ✓ General aspect predictions
248
+ ✓ Specific aspect predictions
249
+ ✓ Hierarchical constraint application
250
+ ✓ Per-aspect metrics
251
+ ✓ Sample predictions (first 23 reviews)
252
+ ✓ Overall performance summary
253
+
254
+ Final Summary Cell:
255
+ Comprehensive table showing:
256
+ ✓ Individual model results
257
+ ✓ Hierarchical model results
258
+ ✓ General aspects performance
259
+ ✓ Specific aspects performance
260
+ ✓ Overall 25-label performance
261
+
262
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
263
+ QUICK START CHECKLIST
264
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
265
+ □ Python 3.8+ installed
266
+ □ Jupyter Notebook installed
267
+ □ All required packages installed via pip
268
+ □ GPU with 8GB+ VRAM available
269
+ □ Hugging Face account created and logged in
270
+ □ Project folders created
271
+ □ '[rule-based/gemini]_hierarchical.ipynb' file in project folder
272
+ □ All 6 CSV test files in 'datasets/' folder
273
+ □ All 10 trained model files in 'models/gemini/' and 'models/rule-based/' folder
274
+ □ CSV files have correct column names
275
+ □ Ready to launch: jupyter notebook
276
+
277
+
278
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
279
+ END OF SETUP GUIDE
280
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Setup Instructions for All Techniques/[SETUP] Fine-Tuning (Gemma) - Specifc Models.txt ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PROJECT SETUP GUIDE - SPECIFIC MODELS
2
+
3
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
4
+ SYSTEM REQUIREMENTS
5
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
6
+
7
+ Python 3.8 or higher
8
+ 16GB RAM minimum (32GB recommended)
9
+ NVIDIA GPU with 8GB+ VRAM (recommended for faster training)
10
+ Windows, Linux, or macOS
11
+ 10GB free disk space
12
+
13
+ You may also use Google Colab with T4/A100 GPU selected in Runtime settings
14
+
15
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
16
+ STEP 1: INSTALL PYTHON
17
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
18
+ Download and install Python from: https://www.python.org/downloads/
19
+ During installation:
20
+
21
+ Check "Add Python to PATH"
22
+ Check "Install pip"
23
+
24
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
25
+ STEP 2: INSTALL REQUIRED PACKAGES
26
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
27
+ Open Command Prompt (Windows) or Terminal (Mac/Linux) and copy-paste these commands:
28
+
29
+ pip install pandas numpy scikit-learn matplotlib transformers peft huggingface-hub
30
+
31
+ For CPU-only (slower training):
32
+ pip install torch torchvision torchaudio
33
+
34
+ For GPU support (NVIDIA only - faster training):
35
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
36
+
37
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
38
+ STEP 3: SET UP HUGGING FACE ACCOUNT
39
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
40
+ The scripts use the Google Gemma model (e.g., 'google/gemma-3-1b-pt'), which is a gated model. To access it, you must follow these steps:
41
+
42
+ Step A: Grant Access
43
+ 1. Go to the Hugging Face model page (https://huggingface.co/google/gemma-3-1b-pt).
44
+ 2. Log in to your Hugging Face account.
45
+ 3. Review and accept the license terms to gain access.
46
+
47
+ Step B: Authenticate in the Environment
48
+ You must provide a valid Hugging Face Access Token. You can generate one at https://huggingface.co/settings/tokens.
49
+
50
+ Option 1 (Command Line / Local):
51
+ Go to: https://huggingface.co/
52
+ Login or Create a free account
53
+ Go to Settings > Access Tokens
54
+ Create a new token
55
+ Install Hugging Face CLI:
56
+ pip install huggingface-hub
57
+ Login with your token:
58
+ huggingface-cli login
59
+ Paste your token when prompted
60
+
61
+ Option 2 (Google Colab / Jupyter Notebook):
62
+ If running in a notebook, add a cell at the very top with the following code:
63
+
64
+ from huggingface_hub import login
65
+ login("YOUR_HUGGING_FACE_TOKEN_HERE")
66
+
67
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
68
+ STEP 4: CREATE PROJECT FOLDERS
69
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
70
+ Create this folder structure anywhere on your computer:
71
+ your_project/
72
+ ├── datasets/
73
+ │ ├── [rule-based/gemini]/
74
+ │ │ └── [specific aspect]_train_dataset.csv
75
+ │ └── test_[specific aspect]_dataset.csv
76
+ └── [rule-based/gemini]_[specific aspect]_model.py
77
+
78
+ The project directory should look like this:
79
+ your_project/
80
+ ├── datasets/
81
+ │ ├── rule-based/
82
+ │ │ ├── product_train_dataset.csv
83
+ │ │ ├── delivery_train_dataset.csv
84
+ │ │ ├── price_train_dataset.csv
85
+ │ │ └── service_train_dataset.csv
86
+ │ ├��─ gemini/
87
+ │ │ ├── product_train_dataset.csv
88
+ │ │ ├── delivery_train_dataset.csv
89
+ │ │ ├── price_train_dataset.csv
90
+ │ │ └── service_train_dataset.csv
91
+ │ ├── test_product_dataset.csv
92
+ │ ├── test_delivery_dataset.csv
93
+ │ ├── test_price_dataset.csv
94
+ │ └── test_service_dataset.csv
95
+ ├── rule-based_product_model.py
96
+ ├── rule-based_delivery_model.py
97
+ ├── rule-based_price_model.py
98
+ ├── rule-based_service_model.py
99
+ ├── gemini_product_model.py
100
+ ├── gemini_delivery_model.py
101
+ ├── gemini_price_model.py
102
+ └── gemini_service_model.py
103
+
104
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
105
+ STEP 5: PREPARE YOUR DATA FILES
106
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
107
+ The CSV files must be in its respective directory.
108
+
109
+ Training set: datasets/[rule-based/gemini]/[specific aspect]_train_dataset.csv
110
+
111
+ Test set/Ground truth: datasets/test_[specific aspect]_dataset.csv
112
+
113
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
114
+ STEP 6: UPDATE MODEL SAVE LOCATION (OPTIONAL)
115
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
116
+ By default, the model saves to: C:\temp\new_models
117
+ If you want to save it somewhere else:
118
+
119
+ Open script.py in a text editor
120
+ Find line 416 (around line 416):
121
+
122
+ SAVE_DIR = r"C:\temp\new_models"
123
+
124
+ Change it to your preferred location:
125
+
126
+ Windows example:
127
+ SAVE_DIR = r"C:\Users\YourName\Documents\my_models"
128
+ Mac/Linux example:
129
+ SAVE_DIR = "/home/username/my_models"
130
+ Note: The folder will be created automatically if it doesn't exist.
131
+
132
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
133
+ STEP 7: RUN THE CODE
134
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
135
+
136
+ Open Command Prompt or Terminal
137
+ Navigate to your project folder:
138
+
139
+ cd C:\path\to\your_project
140
+
141
+ Run the script:
142
+
143
+ [rule-based/gemini]_[specific aspect]_model.py
144
+ For example if its Gemini annotated Product specific model, it is "gemini_product_model.py"
145
+
146
+ Wait for training to complete (1-4 hours depending on hardware)
147
+
148
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
149
+ WHAT THE CODE DOES
150
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
151
+
152
+ Loads training and test datasets
153
+ Splits training into 80% train / 20% validation
154
+ Trains respective technique annotated model for specific aspect classification
155
+ Optimizes classification thresholds
156
+ Evaluates model performance
157
+ Saves trained model and generates reports
158
+
159
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
160
+ OUTPUT FILES
161
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
162
+ gemma_[specific aspect]_specific.pt
163
+ Main trained model file
164
+
165
+ gemma_[specific aspect]_classifier.pth
166
+ Model checkpoint with training metadata
167
+
168
+ training_loss_plot_[specific aspect].png
169
+ Training progress visualization
170
+
171
+ training_loss_per_batch_detailed_[specific aspect].png
172
+ Detailed batch-level training curves
173
+
174
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━��━━━
175
+ CONSOLE OUTPUT EXPLANATION
176
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
177
+ While running, you'll see:
178
+ ✓ Dataset loading confirmation
179
+ ✓ Class imbalance analysis (positive/negative ratios)
180
+ ✓ Training progress for each epoch
181
+ ✓ Validation loss after each epoch
182
+ ✓ Early stopping notifications
183
+ ✓ Optimal threshold calculations
184
+ ✓ Classification reports (precision, recall, F1-score)
185
+ ✓ Sample predictions vs ground truth
186
+
187
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
188
+ QUICK START CHECKLIST
189
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
190
+ □ Python 3.8+ installed
191
+ □ All packages installed via pip
192
+ □ Hugging Face account created and logged in
193
+ □ Project folders created
194
+ □ CSV files placed in correct locations
195
+ □ (Optional) Updated saved model directory "SAVE_DIR" in [rule-based/gemini]_[specific aspect]_model.py
196
+ □ Ready to run: [rule-based/gemini]_[specific aspect]_model.py
197
+
198
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
199
+ END OF SETUP GUIDE
200
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Setup Instructions for All Techniques/[SETUP] LLM.txt ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SETUP INSTRUCTIONS FOR LLM MODEL
2
+
3
+ Option 1: Quick Start (Google Colab)
4
+ ---------------------------------------------------------
5
+ The easiest way to run these notebooks is using Google Colab, which requires no local installation.
6
+
7
+ 1. Go to https://colab.research.google.com/
8
+ 2. Click "File" > "Upload notebook"
9
+ 3. Upload the .ipynb file you wish to run: gemini_pipeline.ipynb
10
+ 4. Upload the required data files (CSVs, JSONs) to the Colab "Files" sidebar.
11
+ - go to CSVs, JSONs, and PDFs found in the SOURCE>Data folder
12
+ 5. Run `pip install -r requirements.txt` to install dependencies.
13
+
14
+
15
+ Option 2: Local Installation (Run on your computer)
16
+ ---------------------------------------------------------
17
+ Prerequisites: Python 3.8 or higher
18
+
19
+ 1. Install Jupyter Notebook (if not already installed):
20
+ Open your terminal/command prompt and run:
21
+ pip install notebook
22
+
23
+ 2. Create a Virtual Environment (Recommended):
24
+ python -m venv venv
25
+
26
+ # Windows:
27
+ venv\Scripts\activate
28
+ # Mac/Linux:
29
+ source venv/bin/activate
30
+
31
+ 3. Install Project Dependencies:
32
+ Navigate to the SOURCE>Data folder and run:
33
+ pip install -r requirements.txt
34
+
35
+ 4. Start the Application:
36
+ Run the following command to open the interface:
37
+ jupyter notebook
Setup Instructions for All Techniques/[SETUP] Rule-Based.txt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Rule-Based Keyword Annotator Dependencies and Setup
2
+
3
+ 1. Python Environment
4
+ This script requires Python.
5
+
6
+ 2. Required External Libraries
7
+ The following Python libraries are required. You can install them using pip:
8
+
9
+ pip install pandas nltk
10
+
11
+ Library Descriptions:
12
+ - pandas: Used for loading the dataset (CSV) and handling data frames.
13
+ - nltk: (Natural Language Toolkit) Used for tokenization and accessing standard stopword lists.
14
+
15
+ Note: Other imported modules (re, csv, collections, string, warnings) are part of the standard Python library and do not need installation.
16
+
17
+ 3. Required Data Files
18
+ Ensure the following files are present in the same directory as the notebook before running:
19
+
20
+ a. Input Dataset: 'SentiTaglish_ProductsAndServices.csv'
21
+ The script expects this CSV file containing the reviews to be processed.
22
+
23
+ b. Stopwords File: 'stopwords-new.txt'
24
+ The script attempts to load a custom list of Filipino stopwords from this file.
25
+ Ensure this text file exists in the directory.
26
+
27
+ 4. NLTK Data Downloads
28
+ The script includes automated commands to download necessary NLTK data.
29
+ On the first run, ensure you have an internet connection so the script can download:
30
+ - 'punkt' (Tokenizer models)
31
+ - 'stopwords' (Standard stopword corpora)
32
+
33
+ If you are running in an offline environment, you must download these NLTK packages beforehand using `nltk.download()`.
Setup Instructions for All Techniques/[SETUP] Topic Modeling.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Topic Modeling Project Setup (LDA & BERTopic)
2
+
3
+ 1. Python Environment
4
+ These scripts require Python 3.8 or higher.
5
+
6
+ 2. Required External Libraries
7
+ Install the following libraries to run both the LDA and BERTopic notebooks. You can install them using pip:
8
+
9
+ pip install pandas gensim nltk pyldavis bertopic plotly scikit-learn
10
+
11
+ Library Descriptions:
12
+ - pandas: Data manipulation and CSV loading.
13
+ - gensim: Core library for LDA topic modeling.
14
+ - nltk: Natural Language Toolkit for stopword removal and tokenization.
15
+ - pyldavis: Interactive visualization for LDA models.
16
+ - bertopic: Advanced topic modeling technique that leverages transformers (BERTopic notebook).
17
+ - plotly: Visualization library used by BERTopic.
18
+ - scikit-learn: Required dependency for BERTopic (and general ML utilities).
19
+
20
+ 3. Required Data Files
21
+ Ensure the following files are present in the same directory as the notebooks before running:
22
+
23
+ a. Input Dataset: 'SentiTaglish_ProductsAndServices.csv'
24
+ Both notebooks require this CSV file containing the reviews to be processed.
25
+
26
+ b. Stopwords File: 'stopwords-new.txt'
27
+ The LDA script specifically looks for this file to load custom Tagalog/Filipino stopwords.
28
+ Ensure this text file exists in the directory.
29
+
30
+ 4. NLTK Data Downloads
31
+ The scripts include automated commands (`nltk.download('stopwords')`) to download necessary NLTK data.
32
+ On the first run, ensure you have an internet connection.
33
+
34
+ 5. Hardware Note (BERTopic)
35
+ The BERTopic notebook uses transformer models which can be computationally intensive. A GPU is recommended for faster processing, though it will run on a standard CPU (just slower).
36
+
37
+ If running on Google Colab:
38
+ - Go to Runtime > Change runtime type > Select T4 GPU for better performance.