Pujan-Dev commited on
Commit
5298fcc
·
1 Parent(s): 33fb2d7

Added documentation

Browse files
notebook/ai_vs_human/main.ipynb ADDED
@@ -0,0 +1,1110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "e522047b",
6
+ "metadata": {},
7
+ "source": [
8
+ "# AI vs Human Text Detector using BERT\n",
9
+ "Using google-bert/bert-base-cased with HC3 dataset or local data (~20k samples)"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": 35,
15
+ "id": "16eddd36",
16
+ "metadata": {},
17
+ "outputs": [],
18
+ "source": [
19
+ "from functools import partial\n",
20
+ "\n",
21
+ "import datasets\n",
22
+ "from datasets import Dataset, DatasetDict, concatenate_datasets\n",
23
+ "import evaluate\n",
24
+ "import numpy as np\n",
25
+ "import torch\n",
26
+ "from transformers import (\n",
27
+ " AutoModelForSequenceClassification,\n",
28
+ " AutoTokenizer,\n",
29
+ " PreTrainedTokenizer,\n",
30
+ " BatchEncoding,\n",
31
+ " DataCollatorWithPadding,\n",
32
+ " Trainer,\n",
33
+ " TrainingArguments,\n",
34
+ ")\n",
35
+ "from peft import LoraConfig, get_peft_model"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "markdown",
40
+ "id": "99bca750",
41
+ "metadata": {},
42
+ "source": [
43
+ "## Load AI Detection Dataset (~20k samples)"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "code",
48
+ "execution_count": 36,
49
+ "id": "2945f87a",
50
+ "metadata": {},
51
+ "outputs": [],
52
+ "source": [
53
+ "def get_raid_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
54
+ " \"\"\"Load AI detection dataset and limit to ~20k samples\"\"\"\n",
55
+ " \n",
56
+ " print(\"Loading AI vs Human text dataset...\")\n",
57
+ " \n",
58
+ " all_texts = []\n",
59
+ " all_labels = []\n",
60
+ " \n",
61
+ " # Try loading HC3 dataset (Human ChatGPT Comparison Corpus)\n",
62
+ " try:\n",
63
+ " print(\"Attempting to load HC3 dataset...\")\n",
64
+ " dataset = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
65
+ " \n",
66
+ " # HC3 format: has 'question', 'human_answers', 'chatgpt_answers'\n",
67
+ " for item in dataset:\n",
68
+ " # Add human answers\n",
69
+ " if 'human_answers' in item and item['human_answers']:\n",
70
+ " for answer in item['human_answers'][:1]: # Take first answer\n",
71
+ " if answer and len(answer.strip()) > 0:\n",
72
+ " all_texts.append(answer)\n",
73
+ " all_labels.append(0) # 0 for human\n",
74
+ " \n",
75
+ " # Add AI answers\n",
76
+ " if 'chatgpt_answers' in item and item['chatgpt_answers']:\n",
77
+ " for answer in item['chatgpt_answers'][:1]: # Take first answer\n",
78
+ " if answer and len(answer.strip()) > 0:\n",
79
+ " all_texts.append(answer)\n",
80
+ " all_labels.append(1) # 1 for AI\n",
81
+ " \n",
82
+ " print(f\"✓ Loaded {len(all_texts)} samples from HC3 dataset\")\n",
83
+ " except Exception as e:\n",
84
+ " print(f\"⚠ Could not load HC3 dataset: {e}\")\n",
85
+ " \n",
86
+ " # Load local data and combine\n",
87
+ " if use_local:\n",
88
+ " try:\n",
89
+ " print(\"Loading local dataset...\")\n",
90
+ " import pandas as pd\n",
91
+ " df = pd.read_json(\"./DATASET/basic_Data.jsonl\", lines=True)\n",
92
+ " \n",
93
+ " # Build a proper binary classification dataset: human_text -> 0, ai_text -> 1\n",
94
+ " if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
95
+ " local_texts = list(df[\"human_text\"].dropna()) + list(df[\"ai_text\"].dropna())\n",
96
+ " local_labels = [0] * len(df[\"human_text\"].dropna()) + [1] * len(df[\"ai_text\"].dropna())\n",
97
+ " \n",
98
+ " all_texts.extend(local_texts)\n",
99
+ " all_labels.extend(local_labels)\n",
100
+ " \n",
101
+ " print(f\"✓ Loaded {len(local_texts)} samples from local data\")\n",
102
+ " else:\n",
103
+ " print(\"⚠ Local dataset doesn't have required columns\")\n",
104
+ " except Exception as e:\n",
105
+ " print(f\"⚠ Could not load local dataset: {e}\")\n",
106
+ " \n",
107
+ " # Check if we have any data\n",
108
+ " if len(all_texts) == 0:\n",
109
+ " raise ValueError(\"No data loaded! Check HC3 dataset or local data availability\")\n",
110
+ " \n",
111
+ " # Create combined dataset\n",
112
+ " combined_dataset = Dataset.from_dict({\n",
113
+ " \"text\": all_texts,\n",
114
+ " \"label\": all_labels\n",
115
+ " })\n",
116
+ " \n",
117
+ " print(f\"Total combined samples: {len(combined_dataset)}\")\n",
118
+ " \n",
119
+ " # Shuffle and limit to max_samples\n",
120
+ " combined_dataset = combined_dataset.shuffle(seed=42)\n",
121
+ " if len(combined_dataset) > max_samples:\n",
122
+ " combined_dataset = combined_dataset.select(range(max_samples))\n",
123
+ " print(f\"Limited to {max_samples} samples\")\n",
124
+ " \n",
125
+ " # Filter out empty texts\n",
126
+ " combined_dataset = combined_dataset.filter(lambda x: x['text'] is not None and len(x['text'].strip()) > 0)\n",
127
+ " \n",
128
+ " # Split into train/test (95/5 split)\n",
129
+ " dataset_split = combined_dataset.train_test_split(test_size=0.05, seed=42)\n",
130
+ " \n",
131
+ " print(f\"\\n✓ Dataset ready!\")\n",
132
+ " print(f\" Train samples: {len(dataset_split['train'])}\")\n",
133
+ " print(f\" Test samples: {len(dataset_split['test'])}\")\n",
134
+ " \n",
135
+ " # Check label distribution\n",
136
+ " import numpy as np\n",
137
+ " train_labels = np.array(dataset_split['train']['label'])\n",
138
+ " print(f\" Label distribution (train):\")\n",
139
+ " print(f\" Human (0): {(train_labels == 0).sum()}\")\n",
140
+ " print(f\" AI (1): {(train_labels == 1).sum()}\")\n",
141
+ " \n",
142
+ " return dataset_split"
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "execution_count": 37,
148
+ "id": "38d8478c",
149
+ "metadata": {},
150
+ "outputs": [
151
+ {
152
+ "name": "stdout",
153
+ "output_type": "stream",
154
+ "text": [
155
+ "Loading AI vs Human text dataset...\n",
156
+ "Attempting to load HC3 dataset...\n",
157
+ "⚠ Could not load HC3 dataset: Dataset scripts are no longer supported, but found HC3.py\n",
158
+ "Loading local dataset...\n",
159
+ "✓ Loaded 19940 samples from local data\n",
160
+ "Total combined samples: 19940\n"
161
+ ]
162
+ },
163
+ {
164
+ "name": "stderr",
165
+ "output_type": "stream",
166
+ "text": [
167
+ "Filter: 100%|██████████| 19940/19940 [00:00<00:00, 95584.60 examples/s] \n"
168
+ ]
169
+ },
170
+ {
171
+ "name": "stdout",
172
+ "output_type": "stream",
173
+ "text": [
174
+ "\n",
175
+ "✓ Dataset ready!\n",
176
+ " Train samples: 18943\n",
177
+ " Test samples: 997\n",
178
+ " Label distribution (train):\n",
179
+ " Human (0): 9477\n",
180
+ " AI (1): 9466\n"
181
+ ]
182
+ }
183
+ ],
184
+ "source": [
185
+ "# Load dataset\n",
186
+ "raw_datasets = get_raid_dataset(max_samples=20000)"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "markdown",
191
+ "id": "f60191f6",
192
+ "metadata": {},
193
+ "source": [
194
+ "## Initialize Model and Tokenizer"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": 38,
200
+ "id": "315bb737",
201
+ "metadata": {},
202
+ "outputs": [
203
+ {
204
+ "name": "stderr",
205
+ "output_type": "stream",
206
+ "text": [
207
+ "Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1208.24it/s, Materializing param=bert.pooler.dense.weight] \n",
208
+ "BertForSequenceClassification LOAD REPORT from: google-bert/bert-base-cased\n",
209
+ "Key | Status | \n",
210
+ "-------------------------------------------+------------+-\n",
211
+ "cls.predictions.transform.LayerNorm.bias | UNEXPECTED | \n",
212
+ "cls.seq_relationship.weight | UNEXPECTED | \n",
213
+ "cls.predictions.transform.dense.weight | UNEXPECTED | \n",
214
+ "cls.seq_relationship.bias | UNEXPECTED | \n",
215
+ "cls.predictions.bias | UNEXPECTED | \n",
216
+ "cls.predictions.transform.dense.bias | UNEXPECTED | \n",
217
+ "cls.predictions.transform.LayerNorm.weight | UNEXPECTED | \n",
218
+ "classifier.weight | MISSING | \n",
219
+ "classifier.bias | MISSING | \n",
220
+ "\n",
221
+ "Notes:\n",
222
+ "- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
223
+ "- MISSING\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\n"
224
+ ]
225
+ },
226
+ {
227
+ "name": "stdout",
228
+ "output_type": "stream",
229
+ "text": [
230
+ "Model loaded: google-bert/bert-base-cased\n",
231
+ "Device: cuda\n"
232
+ ]
233
+ }
234
+ ],
235
+ "source": [
236
+ "# Use google-bert/bert-base-cased\n",
237
+ "base_model_name = \"google-bert/bert-base-cased\"\n",
238
+ "\n",
239
+ "tokenizer = AutoTokenizer.from_pretrained(base_model_name)\n",
240
+ "model = AutoModelForSequenceClassification.from_pretrained(\n",
241
+ " base_model_name,\n",
242
+ " num_labels=2,\n",
243
+ ").to(device='cuda' if torch.cuda.is_available() else 'cpu')\n",
244
+ "\n",
245
+ "print(f\"Model loaded: {base_model_name}\")\n",
246
+ "print(f\"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}\")"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "markdown",
251
+ "id": "0a772192",
252
+ "metadata": {},
253
+ "source": [
254
+ "## Apply LoRA for Parameter-Efficient Fine-tuning"
255
+ ]
256
+ },
257
+ {
258
+ "cell_type": "code",
259
+ "execution_count": 39,
260
+ "id": "ba294e50",
261
+ "metadata": {},
262
+ "outputs": [
263
+ {
264
+ "name": "stdout",
265
+ "output_type": "stream",
266
+ "text": [
267
+ "trainable params: 2,680,322 || all params: 110,992,132 || trainable%: 2.4149\n"
268
+ ]
269
+ }
270
+ ],
271
+ "source": [
272
+ "peft_config = LoraConfig(\n",
273
+ " r=16,\n",
274
+ " target_modules=\"all-linear\",\n",
275
+ " lora_alpha=16,\n",
276
+ " bias=\"none\",\n",
277
+ " lora_dropout=0.05,\n",
278
+ " use_rslora=True,\n",
279
+ " modules_to_save=[\"classifier\"],\n",
280
+ ")\n",
281
+ "\n",
282
+ "model = get_peft_model(model, peft_config)\n",
283
+ "model.print_trainable_parameters()"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "id": "3cf58dd8",
289
+ "metadata": {},
290
+ "source": [
291
+ "## Preprocessing and Tokenization"
292
+ ]
293
+ },
294
+ {
295
+ "cell_type": "code",
296
+ "execution_count": 40,
297
+ "id": "c7992ba4",
298
+ "metadata": {},
299
+ "outputs": [
300
+ {
301
+ "name": "stderr",
302
+ "output_type": "stream",
303
+ "text": [
304
+ "Map: 100%|██████████| 18943/18943 [00:01<00:00, 10132.04 examples/s]\n",
305
+ "Map: 100%|██████████| 997/997 [00:00<00:00, 11498.07 examples/s]"
306
+ ]
307
+ },
308
+ {
309
+ "name": "stdout",
310
+ "output_type": "stream",
311
+ "text": [
312
+ "Tokenization complete!\n",
313
+ "Tensor columns: ['input_ids', 'attention_mask', 'token_type_ids', 'labels']\n"
314
+ ]
315
+ },
316
+ {
317
+ "name": "stderr",
318
+ "output_type": "stream",
319
+ "text": [
320
+ "\n"
321
+ ]
322
+ }
323
+ ],
324
+ "source": [
325
+ "def _preprocess_function(\n",
326
+ " batch: dict,\n",
327
+ " tokenizer: PreTrainedTokenizer,\n",
328
+ " max_length: int = 512,\n",
329
+ ") -> BatchEncoding:\n",
330
+ " model_inputs = tokenizer(\n",
331
+ " batch[\"text\"],\n",
332
+ " max_length=max_length,\n",
333
+ " truncation=True,\n",
334
+ " )\n",
335
+ " model_inputs[\"labels\"] = batch[\"label\"]\n",
336
+ " return model_inputs\n",
337
+ "\n",
338
+ "\n",
339
+ "preprocess_function = partial(_preprocess_function, tokenizer=tokenizer)\n",
340
+ "tokenized_datasets = raw_datasets.map(\n",
341
+ " preprocess_function,\n",
342
+ " batched=True,\n",
343
+ " remove_columns=[\"text\", \"label\"],\n",
344
+ ")\n",
345
+ "\n",
346
+ "# Ensure PyTorch tensors and expected columns\n",
347
+ "available_columns = tokenized_datasets[\"train\"].column_names\n",
348
+ "tensor_columns = [\n",
349
+ " column_name\n",
350
+ " for column_name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"]\n",
351
+ " if column_name in available_columns\n",
352
+ "]\n",
353
+ "tokenized_datasets.set_format(type=\"torch\", columns=tensor_columns)\n",
354
+ "\n",
355
+ "print(\"Tokenization complete!\")\n",
356
+ "print(\"Tensor columns:\", tensor_columns)"
357
+ ]
358
+ },
359
+ {
360
+ "cell_type": "markdown",
361
+ "id": "31db700b",
362
+ "metadata": {},
363
+ "source": [
364
+ "## Define Metrics"
365
+ ]
366
+ },
367
+ {
368
+ "cell_type": "code",
369
+ "execution_count": 41,
370
+ "id": "899e4408",
371
+ "metadata": {},
372
+ "outputs": [],
373
+ "source": [
374
+ "metric_accuracy = evaluate.load(\"accuracy\")\n",
375
+ "metric_f1 = evaluate.load(\"f1\")\n",
376
+ "\n",
377
+ "\n",
378
+ "def _compute_metrics(\n",
379
+ " eval_pred: tuple[np.ndarray, np.ndarray],\n",
380
+ " metric_accuracy: evaluate.EvaluationModule,\n",
381
+ " metric_f1: evaluate.EvaluationModule,\n",
382
+ ") -> dict[str, float]:\n",
383
+ " predictions, labels = eval_pred\n",
384
+ "\n",
385
+ " if isinstance(predictions, tuple):\n",
386
+ " predictions = predictions[0]\n",
387
+ "\n",
388
+ " predictions = np.argmax(predictions, axis=1)\n",
389
+ "\n",
390
+ " accuracy = metric_accuracy.compute(predictions=predictions, references=labels)\n",
391
+ " f1 = metric_f1.compute(predictions=predictions, references=labels)\n",
392
+ "\n",
393
+ " assert accuracy is not None and f1 is not None\n",
394
+ "\n",
395
+ " result = {\n",
396
+ " \"accuracy\": accuracy[\"accuracy\"],\n",
397
+ " \"f1\": f1[\"f1\"],\n",
398
+ " }\n",
399
+ "\n",
400
+ " return result\n",
401
+ "\n",
402
+ "\n",
403
+ "compute_metrics = partial(\n",
404
+ " _compute_metrics, metric_accuracy=metric_accuracy, metric_f1=metric_f1\n",
405
+ ")"
406
+ ]
407
+ },
408
+ {
409
+ "cell_type": "markdown",
410
+ "id": "34890c4d",
411
+ "metadata": {},
412
+ "source": [
413
+ "## Training Configuration"
414
+ ]
415
+ },
416
+ {
417
+ "cell_type": "code",
418
+ "execution_count": 42,
419
+ "id": "9717d666",
420
+ "metadata": {},
421
+ "outputs": [],
422
+ "source": [
423
+ "train_batch_size = 4\n",
424
+ "gradient_accumulation_steps = 8\n",
425
+ "eval_batch_size = 4\n",
426
+ "\n",
427
+ "training_args = TrainingArguments(\n",
428
+ " \"./models/bert-base-raid-classifier\",\n",
429
+ " num_train_epochs=5,\n",
430
+ " learning_rate=5e-5,\n",
431
+ " weight_decay=0.1,\n",
432
+ " per_device_train_batch_size=train_batch_size,\n",
433
+ " per_device_eval_batch_size=eval_batch_size,\n",
434
+ " gradient_accumulation_steps=gradient_accumulation_steps,\n",
435
+ " fp16=torch.cuda.is_available(),\n",
436
+ " save_strategy=\"steps\",\n",
437
+ " save_total_limit=2,\n",
438
+ " save_steps=64,\n",
439
+ " metric_for_best_model=\"eval_accuracy\",\n",
440
+ " load_best_model_at_end=True,\n",
441
+ " eval_strategy=\"steps\",\n",
442
+ " eval_steps=64,\n",
443
+ " logging_strategy=\"steps\",\n",
444
+ " logging_steps=16,\n",
445
+ " remove_unused_columns=False,\n",
446
+ ")\n",
447
+ "\n",
448
+ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
449
+ ]
450
+ },
451
+ {
452
+ "cell_type": "markdown",
453
+ "id": "e840a954",
454
+ "metadata": {},
455
+ "source": [
456
+ "## Initialize Trainer and Train"
457
+ ]
458
+ },
459
+ {
460
+ "cell_type": "code",
461
+ "execution_count": 43,
462
+ "id": "0fa3ed58",
463
+ "metadata": {},
464
+ "outputs": [
465
+ {
466
+ "name": "stdout",
467
+ "output_type": "stream",
468
+ "text": [
469
+ "Starting training...\n"
470
+ ]
471
+ },
472
+ {
473
+ "data": {
474
+ "text/html": [
475
+ "\n",
476
+ " <div>\n",
477
+ " \n",
478
+ " <progress value='2960' max='2960' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
479
+ " [2960/2960 1:03:52, Epoch 5/5]\n",
480
+ " </div>\n",
481
+ " <table border=\"1\" class=\"dataframe\">\n",
482
+ " <thead>\n",
483
+ " <tr style=\"text-align: left;\">\n",
484
+ " <th>Step</th>\n",
485
+ " <th>Training Loss</th>\n",
486
+ " <th>Validation Loss</th>\n",
487
+ " <th>Accuracy</th>\n",
488
+ " <th>F1</th>\n",
489
+ " </tr>\n",
490
+ " </thead>\n",
491
+ " <tbody>\n",
492
+ " <tr>\n",
493
+ " <td>64</td>\n",
494
+ " <td>5.212345</td>\n",
495
+ " <td>0.625602</td>\n",
496
+ " <td>0.661986</td>\n",
497
+ " <td>0.634093</td>\n",
498
+ " </tr>\n",
499
+ " <tr>\n",
500
+ " <td>128</td>\n",
501
+ " <td>3.753965</td>\n",
502
+ " <td>0.458432</td>\n",
503
+ " <td>0.771314</td>\n",
504
+ " <td>0.809045</td>\n",
505
+ " </tr>\n",
506
+ " <tr>\n",
507
+ " <td>192</td>\n",
508
+ " <td>3.100017</td>\n",
509
+ " <td>0.287685</td>\n",
510
+ " <td>0.889669</td>\n",
511
+ " <td>0.891089</td>\n",
512
+ " </tr>\n",
513
+ " <tr>\n",
514
+ " <td>256</td>\n",
515
+ " <td>2.328572</td>\n",
516
+ " <td>0.390553</td>\n",
517
+ " <td>0.830491</td>\n",
518
+ " <td>0.855432</td>\n",
519
+ " </tr>\n",
520
+ " <tr>\n",
521
+ " <td>320</td>\n",
522
+ " <td>2.129814</td>\n",
523
+ " <td>0.238838</td>\n",
524
+ " <td>0.911735</td>\n",
525
+ " <td>0.917757</td>\n",
526
+ " </tr>\n",
527
+ " <tr>\n",
528
+ " <td>384</td>\n",
529
+ " <td>1.657923</td>\n",
530
+ " <td>0.388610</td>\n",
531
+ " <td>0.856570</td>\n",
532
+ " <td>0.874671</td>\n",
533
+ " </tr>\n",
534
+ " <tr>\n",
535
+ " <td>448</td>\n",
536
+ " <td>1.758504</td>\n",
537
+ " <td>0.179176</td>\n",
538
+ " <td>0.933801</td>\n",
539
+ " <td>0.937262</td>\n",
540
+ " </tr>\n",
541
+ " <tr>\n",
542
+ " <td>512</td>\n",
543
+ " <td>1.352967</td>\n",
544
+ " <td>0.344061</td>\n",
545
+ " <td>0.867603</td>\n",
546
+ " <td>0.882979</td>\n",
547
+ " </tr>\n",
548
+ " <tr>\n",
549
+ " <td>576</td>\n",
550
+ " <td>1.528169</td>\n",
551
+ " <td>0.143238</td>\n",
552
+ " <td>0.945838</td>\n",
553
+ " <td>0.947368</td>\n",
554
+ " </tr>\n",
555
+ " <tr>\n",
556
+ " <td>640</td>\n",
557
+ " <td>1.692302</td>\n",
558
+ " <td>0.185934</td>\n",
559
+ " <td>0.925777</td>\n",
560
+ " <td>0.930582</td>\n",
561
+ " </tr>\n",
562
+ " <tr>\n",
563
+ " <td>704</td>\n",
564
+ " <td>1.194244</td>\n",
565
+ " <td>0.189194</td>\n",
566
+ " <td>0.927783</td>\n",
567
+ " <td>0.932203</td>\n",
568
+ " </tr>\n",
569
+ " <tr>\n",
570
+ " <td>768</td>\n",
571
+ " <td>1.089103</td>\n",
572
+ " <td>0.191697</td>\n",
573
+ " <td>0.926780</td>\n",
574
+ " <td>0.931455</td>\n",
575
+ " </tr>\n",
576
+ " <tr>\n",
577
+ " <td>832</td>\n",
578
+ " <td>1.313780</td>\n",
579
+ " <td>0.133464</td>\n",
580
+ " <td>0.949850</td>\n",
581
+ " <td>0.950593</td>\n",
582
+ " </tr>\n",
583
+ " <tr>\n",
584
+ " <td>896</td>\n",
585
+ " <td>1.144064</td>\n",
586
+ " <td>0.161593</td>\n",
587
+ " <td>0.943831</td>\n",
588
+ " <td>0.946463</td>\n",
589
+ " </tr>\n",
590
+ " <tr>\n",
591
+ " <td>960</td>\n",
592
+ " <td>1.503407</td>\n",
593
+ " <td>0.211920</td>\n",
594
+ " <td>0.921765</td>\n",
595
+ " <td>0.927374</td>\n",
596
+ " </tr>\n",
597
+ " <tr>\n",
598
+ " <td>1024</td>\n",
599
+ " <td>1.106765</td>\n",
600
+ " <td>0.182482</td>\n",
601
+ " <td>0.931795</td>\n",
602
+ " <td>0.935606</td>\n",
603
+ " </tr>\n",
604
+ " <tr>\n",
605
+ " <td>1088</td>\n",
606
+ " <td>1.450451</td>\n",
607
+ " <td>0.127360</td>\n",
608
+ " <td>0.956871</td>\n",
609
+ " <td>0.958212</td>\n",
610
+ " </tr>\n",
611
+ " <tr>\n",
612
+ " <td>1152</td>\n",
613
+ " <td>1.380015</td>\n",
614
+ " <td>0.131538</td>\n",
615
+ " <td>0.957874</td>\n",
616
+ " <td>0.959064</td>\n",
617
+ " </tr>\n",
618
+ " <tr>\n",
619
+ " <td>1216</td>\n",
620
+ " <td>0.755666</td>\n",
621
+ " <td>0.158870</td>\n",
622
+ " <td>0.940822</td>\n",
623
+ " <td>0.943432</td>\n",
624
+ " </tr>\n",
625
+ " <tr>\n",
626
+ " <td>1280</td>\n",
627
+ " <td>0.863713</td>\n",
628
+ " <td>0.157785</td>\n",
629
+ " <td>0.943831</td>\n",
630
+ " <td>0.946565</td>\n",
631
+ " </tr>\n",
632
+ " <tr>\n",
633
+ " <td>1344</td>\n",
634
+ " <td>0.821364</td>\n",
635
+ " <td>0.172321</td>\n",
636
+ " <td>0.944835</td>\n",
637
+ " <td>0.947469</td>\n",
638
+ " </tr>\n",
639
+ " <tr>\n",
640
+ " <td>1408</td>\n",
641
+ " <td>0.957095</td>\n",
642
+ " <td>0.226298</td>\n",
643
+ " <td>0.922768</td>\n",
644
+ " <td>0.927835</td>\n",
645
+ " </tr>\n",
646
+ " <tr>\n",
647
+ " <td>1472</td>\n",
648
+ " <td>0.868089</td>\n",
649
+ " <td>0.197520</td>\n",
650
+ " <td>0.934804</td>\n",
651
+ " <td>0.938505</td>\n",
652
+ " </tr>\n",
653
+ " <tr>\n",
654
+ " <td>1536</td>\n",
655
+ " <td>1.310811</td>\n",
656
+ " <td>0.140865</td>\n",
657
+ " <td>0.953862</td>\n",
658
+ " <td>0.955426</td>\n",
659
+ " </tr>\n",
660
+ " <tr>\n",
661
+ " <td>1600</td>\n",
662
+ " <td>0.708888</td>\n",
663
+ " <td>0.152195</td>\n",
664
+ " <td>0.943831</td>\n",
665
+ " <td>0.946565</td>\n",
666
+ " </tr>\n",
667
+ " <tr>\n",
668
+ " <td>1664</td>\n",
669
+ " <td>0.717255</td>\n",
670
+ " <td>0.176768</td>\n",
671
+ " <td>0.942828</td>\n",
672
+ " <td>0.945663</td>\n",
673
+ " </tr>\n",
674
+ " <tr>\n",
675
+ " <td>1728</td>\n",
676
+ " <td>1.143681</td>\n",
677
+ " <td>0.156816</td>\n",
678
+ " <td>0.951856</td>\n",
679
+ " <td>0.953757</td>\n",
680
+ " </tr>\n",
681
+ " <tr>\n",
682
+ " <td>1792</td>\n",
683
+ " <td>0.638254</td>\n",
684
+ " <td>0.176596</td>\n",
685
+ " <td>0.944835</td>\n",
686
+ " <td>0.947469</td>\n",
687
+ " </tr>\n",
688
+ " <tr>\n",
689
+ " <td>1856</td>\n",
690
+ " <td>1.133300</td>\n",
691
+ " <td>0.119119</td>\n",
692
+ " <td>0.967904</td>\n",
693
+ " <td>0.968317</td>\n",
694
+ " </tr>\n",
695
+ " <tr>\n",
696
+ " <td>1920</td>\n",
697
+ " <td>1.061837</td>\n",
698
+ " <td>0.140624</td>\n",
699
+ " <td>0.957874</td>\n",
700
+ " <td>0.959381</td>\n",
701
+ " </tr>\n",
702
+ " <tr>\n",
703
+ " <td>1984</td>\n",
704
+ " <td>0.708067</td>\n",
705
+ " <td>0.189490</td>\n",
706
+ " <td>0.940822</td>\n",
707
+ " <td>0.943863</td>\n",
708
+ " </tr>\n",
709
+ " <tr>\n",
710
+ " <td>2048</td>\n",
711
+ " <td>0.761451</td>\n",
712
+ " <td>0.150488</td>\n",
713
+ " <td>0.951856</td>\n",
714
+ " <td>0.953846</td>\n",
715
+ " </tr>\n",
716
+ " <tr>\n",
717
+ " <td>2112</td>\n",
718
+ " <td>0.609547</td>\n",
719
+ " <td>0.189622</td>\n",
720
+ " <td>0.940822</td>\n",
721
+ " <td>0.943863</td>\n",
722
+ " </tr>\n",
723
+ " <tr>\n",
724
+ " <td>2176</td>\n",
725
+ " <td>0.803254</td>\n",
726
+ " <td>0.173354</td>\n",
727
+ " <td>0.946841</td>\n",
728
+ " <td>0.949282</td>\n",
729
+ " </tr>\n",
730
+ " <tr>\n",
731
+ " <td>2240</td>\n",
732
+ " <td>0.664540</td>\n",
733
+ " <td>0.154308</td>\n",
734
+ " <td>0.952859</td>\n",
735
+ " <td>0.954764</td>\n",
736
+ " </tr>\n",
737
+ " <tr>\n",
738
+ " <td>2304</td>\n",
739
+ " <td>0.691763</td>\n",
740
+ " <td>0.144127</td>\n",
741
+ " <td>0.963892</td>\n",
742
+ " <td>0.964706</td>\n",
743
+ " </tr>\n",
744
+ " <tr>\n",
745
+ " <td>2368</td>\n",
746
+ " <td>1.092195</td>\n",
747
+ " <td>0.157182</td>\n",
748
+ " <td>0.957874</td>\n",
749
+ " <td>0.959381</td>\n",
750
+ " </tr>\n",
751
+ " <tr>\n",
752
+ " <td>2432</td>\n",
753
+ " <td>0.752286</td>\n",
754
+ " <td>0.231035</td>\n",
755
+ " <td>0.933801</td>\n",
756
+ " <td>0.937736</td>\n",
757
+ " </tr>\n",
758
+ " <tr>\n",
759
+ " <td>2496</td>\n",
760
+ " <td>0.757014</td>\n",
761
+ " <td>0.185019</td>\n",
762
+ " <td>0.948847</td>\n",
763
+ " <td>0.951103</td>\n",
764
+ " </tr>\n",
765
+ " <tr>\n",
766
+ " <td>2560</td>\n",
767
+ " <td>0.766771</td>\n",
768
+ " <td>0.153019</td>\n",
769
+ " <td>0.958877</td>\n",
770
+ " <td>0.960078</td>\n",
771
+ " </tr>\n",
772
+ " <tr>\n",
773
+ " <td>2624</td>\n",
774
+ " <td>0.434590</td>\n",
775
+ " <td>0.201383</td>\n",
776
+ " <td>0.946841</td>\n",
777
+ " <td>0.949282</td>\n",
778
+ " </tr>\n",
779
+ " <tr>\n",
780
+ " <td>2688</td>\n",
781
+ " <td>0.565482</td>\n",
782
+ " <td>0.181478</td>\n",
783
+ " <td>0.952859</td>\n",
784
+ " <td>0.954764</td>\n",
785
+ " </tr>\n",
786
+ " <tr>\n",
787
+ " <td>2752</td>\n",
788
+ " <td>0.568177</td>\n",
789
+ " <td>0.201250</td>\n",
790
+ " <td>0.946841</td>\n",
791
+ " <td>0.949282</td>\n",
792
+ " </tr>\n",
793
+ " <tr>\n",
794
+ " <td>2816</td>\n",
795
+ " <td>0.611295</td>\n",
796
+ " <td>0.173839</td>\n",
797
+ " <td>0.954865</td>\n",
798
+ " <td>0.956606</td>\n",
799
+ " </tr>\n",
800
+ " <tr>\n",
801
+ " <td>2880</td>\n",
802
+ " <td>0.716351</td>\n",
803
+ " <td>0.187448</td>\n",
804
+ " <td>0.948847</td>\n",
805
+ " <td>0.951103</td>\n",
806
+ " </tr>\n",
807
+ " <tr>\n",
808
+ " <td>2944</td>\n",
809
+ " <td>0.603852</td>\n",
810
+ " <td>0.184578</td>\n",
811
+ " <td>0.948847</td>\n",
812
+ " <td>0.951103</td>\n",
813
+ " </tr>\n",
814
+ " </tbody>\n",
815
+ "</table><p>"
816
+ ],
817
+ "text/plain": [
818
+ "<IPython.core.display.HTML object>"
819
+ ]
820
+ },
821
+ "metadata": {},
822
+ "output_type": "display_data"
823
+ },
824
+ {
825
+ "data": {
826
+ "text/plain": [
827
+ "TrainOutput(global_step=2960, training_loss=1.3125710455146995, metrics={'train_runtime': 3832.8474, 'train_samples_per_second': 24.711, 'train_steps_per_second': 0.772, 'total_flos': 8360830141838376.0, 'train_loss': 1.3125710455146995, 'epoch': 5.0})"
828
+ ]
829
+ },
830
+ "execution_count": 43,
831
+ "metadata": {},
832
+ "output_type": "execute_result"
833
+ }
834
+ ],
835
+ "source": [
836
+ "trainer = Trainer(\n",
837
+ " model,\n",
838
+ " training_args,\n",
839
+ " train_dataset=tokenized_datasets[\"train\"],\n",
840
+ " eval_dataset=tokenized_datasets[\"test\"],\n",
841
+ " data_collator=data_collator,\n",
842
+ " compute_metrics=compute_metrics,\n",
843
+ ")\n",
844
+ "\n",
845
+ "print(\"Starting training...\")\n",
846
+ "trainer.train()"
847
+ ]
848
+ },
849
+ {
850
+ "cell_type": "markdown",
851
+ "id": "cde9bbb1",
852
+ "metadata": {},
853
+ "source": [
854
+ "## Final Evaluation"
855
+ ]
856
+ },
857
+ {
858
+ "cell_type": "code",
859
+ "execution_count": 44,
860
+ "id": "bb81afb9",
861
+ "metadata": {},
862
+ "outputs": [
863
+ {
864
+ "name": "stdout",
865
+ "output_type": "stream",
866
+ "text": [
867
+ "Evaluating model...\n"
868
+ ]
869
+ },
870
+ {
871
+ "data": {
872
+ "text/html": [
873
+ "\n",
874
+ " <div>\n",
875
+ " \n",
876
+ " <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
877
+ " [250/250 00:14]\n",
878
+ " </div>\n",
879
+ " "
880
+ ],
881
+ "text/plain": [
882
+ "<IPython.core.display.HTML object>"
883
+ ]
884
+ },
885
+ "metadata": {},
886
+ "output_type": "display_data"
887
+ },
888
+ {
889
+ "name": "stdout",
890
+ "output_type": "stream",
891
+ "text": [
892
+ "\n",
893
+ "Final Results:\n",
894
+ "Accuracy: 0.9679\n",
895
+ "F1 Score: 0.9683\n"
896
+ ]
897
+ }
898
+ ],
899
+ "source": [
900
+ "print(\"Evaluating model...\")\n",
901
+ "eval_results = trainer.evaluate()\n",
902
+ "print(\"\\nFinal Results:\")\n",
903
+ "print(f\"Accuracy: {eval_results['eval_accuracy']:.4f}\")\n",
904
+ "print(f\"F1 Score: {eval_results['eval_f1']:.4f}\")"
905
+ ]
906
+ },
907
+ {
908
+ "cell_type": "markdown",
909
+ "id": "8bf17a40",
910
+ "metadata": {},
911
+ "source": [
912
+ "## Save Model"
913
+ ]
914
+ },
915
+ {
916
+ "cell_type": "code",
917
+ "execution_count": 45,
918
+ "id": "e580bfd6",
919
+ "metadata": {},
920
+ "outputs": [
921
+ {
922
+ "name": "stdout",
923
+ "output_type": "stream",
924
+ "text": [
925
+ "Model saved successfully!\n"
926
+ ]
927
+ }
928
+ ],
929
+ "source": [
930
+ "# Save the final model\n",
931
+ "trainer.save_model(\"./trained_model/bert-base-raid-final\")\n",
932
+ "print(\"Model saved successfully!\")"
933
+ ]
934
+ },
935
+ {
936
+ "cell_type": "markdown",
937
+ "id": "99c0a2f0",
938
+ "metadata": {},
939
+ "source": [
940
+ "## test the model\n"
941
+ ]
942
+ },
943
+ {
944
+ "cell_type": "code",
945
+ "execution_count": 46,
946
+ "id": "016cc53e",
947
+ "metadata": {},
948
+ "outputs": [
949
+ {
950
+ "name": "stdout",
951
+ "output_type": "stream",
952
+ "text": [
953
+ "Prediction for human-written text:\n",
954
+ "{'predicted_label': 0, 'probability_human': 0.9988395571708679, 'probability_ai': 0.0011604195460677147}\n",
955
+ "\n",
956
+ "Prediction for AI-generated text:\n",
957
+ "{'predicted_label': 0, 'probability_human': 0.9988927245140076, 'probability_ai': 0.0011073390487581491}\n"
958
+ ]
959
+ }
960
+ ],
961
+ "source": [
962
+ "def predict(text: str) -> dict[str, float]:\n",
963
+ " inputs = tokenizer(\n",
964
+ " text,\n",
965
+ " max_length=512,\n",
966
+ " truncation=True,\n",
967
+ " return_tensors=\"pt\",\n",
968
+ " ).to(model.device)\n",
969
+ "\n",
970
+ " with torch.no_grad():\n",
971
+ " outputs = model(**inputs)\n",
972
+ " logits = outputs.logits\n",
973
+ " probabilities = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
974
+ " predicted_label = np.argmax(probabilities)\n",
975
+ "\n",
976
+ " return {\n",
977
+ " \"predicted_label\": int(predicted_label),\n",
978
+ " \"probability_human\": float(probabilities[0]),\n",
979
+ " \"probability_ai\": float(probabilities[1]),\n",
980
+ " }\n",
981
+ " \n",
982
+ "text = \"Ai will replace this world. today in the nepal election someone might win by using ai.\"\n",
983
+ "text_by_ai = \"This is a sample text generated by AI.Also This is an long text by AI.\"\n",
984
+ "print(\"Prediction for human-written text:\")\n",
985
+ "print(predict(text))\n",
986
+ "print(\"\\nPrediction for AI-generated text:\")\n",
987
+ "print(predict(text_by_ai))\n"
988
+ ]
989
+ },
990
+ {
991
+ "cell_type": "markdown",
992
+ "id": "7c6c2a5d",
993
+ "metadata": {},
994
+ "source": [
995
+ "def predict"
996
+ ]
997
+ },
998
+ {
999
+ "cell_type": "code",
1000
+ "execution_count": 47,
1001
+ "id": "1b287605",
1002
+ "metadata": {},
1003
+ "outputs": [
1004
+ {
1005
+ "name": "stdout",
1006
+ "output_type": "stream",
1007
+ "text": [
1008
+ "Using 512 samples for RAID quick test\n"
1009
+ ]
1010
+ },
1011
+ {
1012
+ "ename": "OutOfMemoryError",
1013
+ "evalue": "CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",
1014
+ "output_type": "error",
1015
+ "traceback": [
1016
+ "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
1017
+ "\u001b[31mOutOfMemoryError\u001b[39m Traceback (most recent call last)",
1018
+ "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 32\u001b[39m\n\u001b[32m 28\u001b[39m \u001b[38;5;66;03m# Return AI-class probability for each input text\u001b[39;00m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m probabilities[:, \u001b[32m1\u001b[39m].astype(\u001b[38;5;28mfloat\u001b[39m).tolist()\n\u001b[32m---> \u001b[39m\u001b[32m32\u001b[39m predictions = \u001b[43mrun_detection\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmy_detector\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtest_df\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 33\u001b[39m evaluation_result = run_evaluation(predictions, test_df)\n\u001b[32m 35\u001b[39m evaluation_result\n",
1019
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/raid/detect.py:6\u001b[39m, in \u001b[36mrun_detection\u001b[39m\u001b[34m(f, df)\u001b[39m\n\u001b[32m 3\u001b[39m scores_df = df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m]].copy()\n\u001b[32m 5\u001b[39m \u001b[38;5;66;03m# Run the detector function on the dataset and put output in score column\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m scores_df[\u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mgeneration\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtolist\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[38;5;66;03m# Convert scores and ids to dict in 'records' format for seralization\u001b[39;00m\n\u001b[32m 9\u001b[39m \u001b[38;5;66;03m# e.g. [{'id':'...', 'score':0}, {'id':'...', 'score':1}, ...]\u001b[39;00m\n\u001b[32m 10\u001b[39m results = scores_df[[\u001b[33m\"\u001b[39m\u001b[33mid\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mscore\u001b[39m\u001b[33m\"\u001b[39m]].to_dict(orient=\u001b[33m\"\u001b[39m\u001b[33mrecords\u001b[39m\u001b[33m\"\u001b[39m)\n",
1020
+ "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[47]\u001b[39m\u001b[32m, line 24\u001b[39m, in \u001b[36mmy_detector\u001b[39m\u001b[34m(texts)\u001b[39m\n\u001b[32m 22\u001b[39m model.eval()\n\u001b[32m 23\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m torch.no_grad():\n\u001b[32m---> \u001b[39m\u001b[32m24\u001b[39m outputs = \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 25\u001b[39m logits = outputs.logits\n\u001b[32m 26\u001b[39m probabilities = torch.softmax(logits, dim=-\u001b[32m1\u001b[39m).cpu().numpy()\n",
1021
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1022
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1023
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:819\u001b[39m, in \u001b[36mconvert_outputs_to_fp32.<locals>.forward\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 818\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(*args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m819\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1024
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/accelerate/utils/operations.py:807\u001b[39m, in \u001b[36mConvertOutputsToFp32.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 806\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__call__\u001b[39m(\u001b[38;5;28mself\u001b[39m, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m807\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m convert_to_fp32(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m)\n",
1025
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/amp/autocast_mode.py:44\u001b[39m, in \u001b[36mautocast_decorator.<locals>.decorate_autocast\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 41\u001b[39m \u001b[38;5;129m@functools\u001b[39m.wraps(func)\n\u001b[32m 42\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mdecorate_autocast\u001b[39m(*args, **kwargs):\n\u001b[32m 43\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m autocast_instance:\n\u001b[32m---> \u001b[39m\u001b[32m44\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1026
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/peft/peft_model.py:921\u001b[39m, in \u001b[36mPeftModel.forward\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 919\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m._enable_peft_forward_hooks(*args, **kwargs):\n\u001b[32m 920\u001b[39m kwargs = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.special_peft_forward_args}\n\u001b[32m--> \u001b[39m\u001b[32m921\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mget_base_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1027
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1028
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1029
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:835\u001b[39m, in \u001b[36mcan_return_tuple.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 833\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 834\u001b[39m return_dict = return_dict_passed\n\u001b[32m--> \u001b[39m\u001b[32m835\u001b[39m output = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 836\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 837\u001b[39m output = output.to_tuple()\n",
1030
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:1162\u001b[39m, in \u001b[36mBertForSequenceClassification.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, labels, **kwargs)\u001b[39m\n\u001b[32m 1144\u001b[39m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[32m 1145\u001b[39m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[32m 1146\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\n\u001b[32m (...)\u001b[39m\u001b[32m 1154\u001b[39m **kwargs: Unpack[TransformersKwargs],\n\u001b[32m 1155\u001b[39m ) -> \u001b[38;5;28mtuple\u001b[39m[torch.Tensor] | SequenceClassifierOutput:\n\u001b[32m 1156\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1157\u001b[39m \u001b[33;03m labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):\u001b[39;00m\n\u001b[32m 1158\u001b[39m \u001b[33;03m Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,\u001b[39;00m\n\u001b[32m 1159\u001b[39m \u001b[33;03m config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If\u001b[39;00m\n\u001b[32m 1160\u001b[39m \u001b[33;03m `config.num_labels > 1` a classification loss is computed (Cross-Entropy).\u001b[39;00m\n\u001b[32m 1161\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1162\u001b[39m outputs = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbert\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1163\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1164\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1165\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1166\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1167\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1168\u001b[39m \u001b[43m \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 1169\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1170\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1172\u001b[39m pooled_output = outputs[\u001b[32m1\u001b[39m]\n\u001b[32m 1174\u001b[39m pooled_output = \u001b[38;5;28mself\u001b[39m.dropout(pooled_output)\n",
1031
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1032
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1033
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/generic.py:1002\u001b[39m, in \u001b[36mcheck_model_inputs.<locals>.wrapped_fn.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1000\u001b[39m outputs = func(\u001b[38;5;28mself\u001b[39m, *args, **kwargs)\n\u001b[32m 1001\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1002\u001b[39m outputs = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1003\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[32m 1004\u001b[39m \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[32m 1005\u001b[39m \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[32m 1006\u001b[39m \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[32m 1007\u001b[39m kwargs_without_recordable = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
1034
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:679\u001b[39m, in \u001b[36mBertModel.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, cache_position, **kwargs)\u001b[39m\n\u001b[32m 676\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m cache_position \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 677\u001b[39m cache_position = torch.arange(past_key_values_length, past_key_values_length + seq_length, device=device)\n\u001b[32m--> \u001b[39m\u001b[32m679\u001b[39m embedding_output = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43membeddings\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 680\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 681\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 682\u001b[39m \u001b[43m \u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken_type_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 683\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 684\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values_length\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 685\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 687\u001b[39m attention_mask, encoder_attention_mask = \u001b[38;5;28mself\u001b[39m._create_attention_masks(\n\u001b[32m 688\u001b[39m attention_mask=attention_mask,\n\u001b[32m 689\u001b[39m encoder_attention_mask=encoder_attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 693\u001b[39m past_key_values=past_key_values,\n\u001b[32m 694\u001b[39m )\n\u001b[32m 696\u001b[39m encoder_outputs = \u001b[38;5;28mself\u001b[39m.encoder(\n\u001b[32m 697\u001b[39m embedding_output,\n\u001b[32m 698\u001b[39m attention_mask=attention_mask,\n\u001b[32m (...)\u001b[39m\u001b[32m 705\u001b[39m **kwargs,\n\u001b[32m 706\u001b[39m )\n",
1035
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1736\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1734\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1735\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1736\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1036
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py:1747\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1742\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1743\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1744\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1745\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1746\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1747\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1749\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1750\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
1037
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:107\u001b[39m, in \u001b[36mBertEmbeddings.forward\u001b[39m\u001b[34m(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)\u001b[39m\n\u001b[32m 104\u001b[39m embeddings = inputs_embeds + token_type_embeddings\n\u001b[32m 106\u001b[39m position_embeddings = \u001b[38;5;28mself\u001b[39m.position_embeddings(position_ids)\n\u001b[32m--> \u001b[39m\u001b[32m107\u001b[39m embeddings = \u001b[43membeddings\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m\n\u001b[32m 109\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.LayerNorm(embeddings)\n\u001b[32m 110\u001b[39m embeddings = \u001b[38;5;28mself\u001b[39m.dropout(embeddings)\n",
1038
+ "\u001b[31mOutOfMemoryError\u001b[39m: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 3.68 GiB of which 719.12 MiB is free. Process 2034 has 46.03 MiB memory in use. Process 1961 has 6.78 MiB memory in use. Including non-PyTorch memory, this process has 2.90 GiB memory in use. Of the allocated memory 2.71 GiB is allocated by PyTorch, and 85.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
1039
+ ]
1040
+ }
1041
+ ],
1042
+ "source": [
1043
+ "from raid import run_detection, run_evaluation\n",
1044
+ "from raid.utils import load_data\n",
1045
+ "\n",
1046
+ "# Use test split and cap sample size for a quick RAID validation\n",
1047
+ "test_df = load_data(split=\"test\")\n",
1048
+ "sample_size = min(int(len(test_df) * 0.02), 512)\n",
1049
+ "test_df = test_df.sample(n=sample_size, random_state=42)\n",
1050
+ "\n",
1051
+ "print(f\"Using {len(test_df)} samples for RAID quick test\")\n",
1052
+ "\n",
1053
+ "\n",
1054
+ "def my_detector(texts: list[str]) -> list[float]:\n",
1055
+ " # RAID passes a batch/list of strings and expects a list of scores\n",
1056
+ " inputs = tokenizer(\n",
1057
+ " texts,\n",
1058
+ " max_length=512,\n",
1059
+ " truncation=True,\n",
1060
+ " padding=True,\n",
1061
+ " return_tensors=\"pt\",\n",
1062
+ " ).to(model.device)\n",
1063
+ "\n",
1064
+ " model.eval()\n",
1065
+ " with torch.no_grad():\n",
1066
+ " outputs = model(**inputs)\n",
1067
+ " logits = outputs.logits\n",
1068
+ " probabilities = torch.softmax(logits, dim=-1).cpu().numpy()\n",
1069
+ "\n",
1070
+ " # Return AI-class probability for each input text\n",
1071
+ " return probabilities[:, 1].astype(float).tolist()\n",
1072
+ "\n",
1073
+ "\n",
1074
+ "predictions = run_detection(my_detector, test_df)\n",
1075
+ "evaluation_result = run_evaluation(predictions, test_df)\n",
1076
+ "\n",
1077
+ "evaluation_result"
1078
+ ]
1079
+ },
1080
+ {
1081
+ "cell_type": "code",
1082
+ "execution_count": null,
1083
+ "id": "6b6eb543",
1084
+ "metadata": {},
1085
+ "outputs": [],
1086
+ "source": []
1087
+ }
1088
+ ],
1089
+ "metadata": {
1090
+ "kernelspec": {
1091
+ "display_name": "ml",
1092
+ "language": "python",
1093
+ "name": "python3"
1094
+ },
1095
+ "language_info": {
1096
+ "codemirror_mode": {
1097
+ "name": "ipython",
1098
+ "version": 3
1099
+ },
1100
+ "file_extension": ".py",
1101
+ "mimetype": "text/x-python",
1102
+ "name": "python",
1103
+ "nbconvert_exporter": "python",
1104
+ "pygments_lexer": "ipython3",
1105
+ "version": "3.11.6"
1106
+ }
1107
+ },
1108
+ "nbformat": 4,
1109
+ "nbformat_minor": 5
1110
+ }
notebook/ai_vs_human/mainv2.ipynb ADDED
@@ -0,0 +1,1170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "464eefd0",
6
+ "metadata": {},
7
+ "source": [
8
+ "# AI vs Human Detector V2\n",
9
+ "This notebook trains a V2 model that explicitly supports short inputs (including sentences under 50 words) and saves artifacts in `v2_model/`."
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "markdown",
14
+ "id": "0be0e8d9",
15
+ "metadata": {},
16
+ "source": [
17
+ "## ✅ Bug Fixes & Capabilities\n",
18
+ "\n",
19
+ "**Fixed Issues:**\n",
20
+ "1. ✅ Runtime error when calling `trainer.evaluate()` after training (removed duplicate evaluation)\n",
21
+ "2. ✅ Missing `accelerate` dependency (auto-installs if needed)\n",
22
+ "3. ✅ Recursive dataset loading from `./DATASET/` folder (supports `.jsonl`, `.json`, `.csv`)\n",
23
+ "4. ✅ Short sentence support (<50 words) with data augmentation\n",
24
+ "\n",
25
+ "**Model Capabilities:**\n",
26
+ "- ✅ Works with **all sentence types**: very short (1-10 words), short (10-50), medium (50-150), long (150+)\n",
27
+ "- ✅ Handles edge cases: single words, special characters, numbers, mixed formats\n",
28
+ "- ✅ Batch prediction support\n",
29
+ "- ✅ Saves to `v2_model/` with tokenizer, config, and label map\n",
30
+ "- ✅ Can be loaded independently after saving\n",
31
+ "\n",
32
+ "**Architecture:** DistilRoBERTa-base (faster, lighter than BERT)\n",
33
+ "\n",
34
+ "**Quick Start:**\n",
35
+ "1. Run cells 1-7 to prepare data\n",
36
+ "2. Run cell 8 to train (takes ~15-30 min on GPU)\n",
37
+ "3. Run cell 9 to save to `v2_model/`\n",
38
+ "4. Run cells 10-12 to test all sentence types"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "markdown",
43
+ "id": "3a8134db",
44
+ "metadata": {},
45
+ "source": [
46
+ "## Additional Testing: Extreme Edge Cases & Batch Prediction"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": 1,
52
+ "id": "f400f763",
53
+ "metadata": {},
54
+ "outputs": [
55
+ {
56
+ "name": "stdout",
57
+ "output_type": "stream",
58
+ "text": [
59
+ "Note: you may need to restart the kernel to use updated packages.\n"
60
+ ]
61
+ }
62
+ ],
63
+ "source": [
64
+ "%pip install -q -U datasets evaluate transformers torch pandas scikit-learn accelerate"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": 2,
70
+ "id": "0c3d4d6d",
71
+ "metadata": {},
72
+ "outputs": [
73
+ {
74
+ "name": "stderr",
75
+ "output_type": "stream",
76
+ "text": [
77
+ "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
78
+ " from .autonotebook import tqdm as notebook_tqdm\n",
79
+ "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?\n",
80
+ " warn(\n",
81
+ "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/datapoints/__init__.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
82
+ " warnings.warn(_BETA_TRANSFORMS_WARNING)\n",
83
+ "/home/pujan/miniconda3/envs/ml/lib/python3.11/site-packages/torchvision/transforms/v2/__init__.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().\n",
84
+ " warnings.warn(_BETA_TRANSFORMS_WARNING)\n"
85
+ ]
86
+ }
87
+ ],
88
+ "source": [
89
+ "from __future__ import annotations\n",
90
+ "\n",
91
+ "from dataclasses import dataclass\n",
92
+ "from functools import partial\n",
93
+ "from pathlib import Path\n",
94
+ "import json\n",
95
+ "import random\n",
96
+ "\n",
97
+ "import datasets\n",
98
+ "from datasets import Dataset, DatasetDict, concatenate_datasets\n",
99
+ "import evaluate\n",
100
+ "import numpy as np\n",
101
+ "import pandas as pd\n",
102
+ "import torch\n",
103
+ "from transformers import (\n",
104
+ " AutoModelForSequenceClassification,\n",
105
+ " AutoTokenizer,\n",
106
+ " BatchEncoding,\n",
107
+ " DataCollatorWithPadding,\n",
108
+ " PreTrainedTokenizer,\n",
109
+ " Trainer,\n",
110
+ " TrainingArguments,\n",
111
+ ")\n",
112
+ "from packaging import version"
113
+ ]
114
+ },
115
+ {
116
+ "cell_type": "code",
117
+ "execution_count": 3,
118
+ "id": "624d23ba",
119
+ "metadata": {},
120
+ "outputs": [
121
+ {
122
+ "name": "stdout",
123
+ "output_type": "stream",
124
+ "text": [
125
+ "Base model: distilroberta-base\n",
126
+ "Device: cuda\n",
127
+ "Output path: ./v2_model\n"
128
+ ]
129
+ }
130
+ ],
131
+ "source": [
132
+ "@dataclass\n",
133
+ "class V2Config:\n",
134
+ " base_model_name: str = \"distilroberta-base\"\n",
135
+ " max_samples: int = 20000\n",
136
+ " max_length: int = 256\n",
137
+ " short_word_limit: int = 50\n",
138
+ " short_aug_ratio: float = 0.35\n",
139
+ " output_dir: str = \"./v2_model\"\n",
140
+ " seed: int = 42\n",
141
+ "\n",
142
+ "\n",
143
+ "cfg = V2Config()\n",
144
+ "DEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
145
+ "random.seed(cfg.seed)\n",
146
+ "np.random.seed(cfg.seed)\n",
147
+ "torch.manual_seed(cfg.seed)\n",
148
+ "\n",
149
+ "print(f\"Base model: {cfg.base_model_name}\")\n",
150
+ "print(f\"Device: {DEVICE}\")\n",
151
+ "print(f\"Output path: {cfg.output_dir}\")"
152
+ ]
153
+ },
154
+ {
155
+ "cell_type": "code",
156
+ "execution_count": 4,
157
+ "id": "0a1f860a",
158
+ "metadata": {},
159
+ "outputs": [],
160
+ "source": [
161
+ "def normalize_text(text: str) -> str:\n",
162
+ " return \" \".join(str(text).split()).strip()\n",
163
+ "\n",
164
+ "\n",
165
+ "def count_words(text: str) -> int:\n",
166
+ " return len(normalize_text(text).split())\n",
167
+ "\n",
168
+ "\n",
169
+ "def _load_local_file_to_text_labels(file_path: Path) -> tuple[list[str], list[int]]:\n",
170
+ " texts: list[str] = []\n",
171
+ " labels: list[int] = []\n",
172
+ "\n",
173
+ " try:\n",
174
+ " suffix = file_path.suffix.lower()\n",
175
+ " if suffix == \".jsonl\":\n",
176
+ " df = pd.read_json(file_path, lines=True)\n",
177
+ " elif suffix == \".json\":\n",
178
+ " df = pd.read_json(file_path)\n",
179
+ " elif suffix == \".csv\":\n",
180
+ " df = pd.read_csv(file_path)\n",
181
+ " else:\n",
182
+ " return texts, labels\n",
183
+ "\n",
184
+ " if {\"human_text\", \"ai_text\"}.issubset(df.columns):\n",
185
+ " human_texts = [normalize_text(x) for x in df[\"human_text\"].dropna().tolist()]\n",
186
+ " ai_texts = [normalize_text(x) for x in df[\"ai_text\"].dropna().tolist()]\n",
187
+ " human_texts = [x for x in human_texts if x]\n",
188
+ " ai_texts = [x for x in ai_texts if x]\n",
189
+ " texts.extend(human_texts)\n",
190
+ " labels.extend([0] * len(human_texts))\n",
191
+ " texts.extend(ai_texts)\n",
192
+ " labels.extend([1] * len(ai_texts))\n",
193
+ " return texts, labels\n",
194
+ "\n",
195
+ " # Alternative schema fallback: text + label/ai_gen columns.\n",
196
+ " if \"text\" in df.columns and (\"label\" in df.columns or \"ai_gen\" in df.columns):\n",
197
+ " label_col = \"label\" if \"label\" in df.columns else \"ai_gen\"\n",
198
+ " for _, row in df.iterrows():\n",
199
+ " text = normalize_text(row.get(\"text\", \"\"))\n",
200
+ " if not text:\n",
201
+ " continue\n",
202
+ " val = str(row.get(label_col, \"\")).strip().lower()\n",
203
+ " is_ai = val in {\"1\", \"true\", \"ai\", \"ai-generated\", \"ai_generated\"}\n",
204
+ " texts.append(text)\n",
205
+ " labels.append(1 if is_ai else 0)\n",
206
+ " return texts, labels\n",
207
+ "\n",
208
+ " except Exception as error:\n",
209
+ " print(f\"Skipped file due to parse error: {file_path} ({error})\")\n",
210
+ "\n",
211
+ " return texts, labels\n",
212
+ "\n",
213
+ "\n",
214
+ "def get_combined_dataset(max_samples: int = 20000, use_local: bool = True) -> DatasetDict:\n",
215
+ " all_texts: list[str] = []\n",
216
+ " all_labels: list[int] = []\n",
217
+ "\n",
218
+ " try:\n",
219
+ " hc3 = datasets.load_dataset(\"Hello-SimpleAI/HC3\", \"all\", split=\"train\")\n",
220
+ " for row in hc3:\n",
221
+ " for answer in row.get(\"human_answers\", [])[:1]:\n",
222
+ " text = normalize_text(answer)\n",
223
+ " if text:\n",
224
+ " all_texts.append(text)\n",
225
+ " all_labels.append(0)\n",
226
+ " for answer in row.get(\"chatgpt_answers\", [])[:1]:\n",
227
+ " text = normalize_text(answer)\n",
228
+ " if text:\n",
229
+ " all_texts.append(text)\n",
230
+ " all_labels.append(1)\n",
231
+ " print(f\"HC3 samples: {len(all_texts)}\")\n",
232
+ " except Exception as error:\n",
233
+ " print(f\"HC3 unavailable: {error}\")\n",
234
+ "\n",
235
+ " if use_local:\n",
236
+ " dataset_root = Path(\"./DATASET\")\n",
237
+ " candidates = list(dataset_root.rglob(\"*.jsonl\")) + list(dataset_root.rglob(\"*.json\")) + list(dataset_root.rglob(\"*.csv\"))\n",
238
+ "\n",
239
+ " local_before = len(all_texts)\n",
240
+ " for file_path in candidates:\n",
241
+ " texts, labels = _load_local_file_to_text_labels(file_path)\n",
242
+ " all_texts.extend(texts)\n",
243
+ " all_labels.extend(labels)\n",
244
+ "\n",
245
+ " print(f\"Local recursive files scanned: {len(candidates)}\")\n",
246
+ " print(f\"Local samples added: {len(all_texts) - local_before}\")\n",
247
+ "\n",
248
+ " if not all_texts:\n",
249
+ " raise ValueError(\"No training data loaded from HC3 or local dataset.\")\n",
250
+ "\n",
251
+ " ds = Dataset.from_dict({\"text\": all_texts, \"label\": all_labels})\n",
252
+ " ds = ds.filter(lambda x: x[\"text\"] is not None and len(normalize_text(x[\"text\"])) > 0)\n",
253
+ " ds = ds.shuffle(seed=cfg.seed)\n",
254
+ " if len(ds) > max_samples:\n",
255
+ " ds = ds.select(range(max_samples))\n",
256
+ "\n",
257
+ " split = ds.train_test_split(test_size=0.1, seed=cfg.seed)\n",
258
+ " return split\n",
259
+ "\n",
260
+ "\n",
261
+ "def add_short_text_variants(dataset: Dataset, short_word_limit: int = 50, ratio: float = 0.35) -> Dataset:\n",
262
+ " short_texts: list[str] = []\n",
263
+ " short_labels: list[int] = []\n",
264
+ "\n",
265
+ " for row in dataset:\n",
266
+ " text = normalize_text(row[\"text\"])\n",
267
+ " label = int(row[\"label\"])\n",
268
+ " words = text.split()\n",
269
+ "\n",
270
+ " if len(words) <= short_word_limit:\n",
271
+ " if random.random() < ratio:\n",
272
+ " short_texts.append(text)\n",
273
+ " short_labels.append(label)\n",
274
+ " continue\n",
275
+ "\n",
276
+ " # Keep first N words as a short variant to train behavior on short inputs.\n",
277
+ " if random.random() < ratio:\n",
278
+ " short_text = \" \".join(words[:short_word_limit])\n",
279
+ " short_texts.append(short_text)\n",
280
+ " short_labels.append(label)\n",
281
+ "\n",
282
+ " if not short_texts:\n",
283
+ " return dataset\n",
284
+ "\n",
285
+ " aug = Dataset.from_dict({\"text\": short_texts, \"label\": short_labels})\n",
286
+ " return concatenate_datasets([dataset, aug]).shuffle(seed=cfg.seed)"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": 5,
292
+ "id": "889c5e58",
293
+ "metadata": {},
294
+ "outputs": [
295
+ {
296
+ "name": "stdout",
297
+ "output_type": "stream",
298
+ "text": [
299
+ "HC3 unavailable: Dataset scripts are no longer supported, but found HC3.py\n",
300
+ "Skipped file due to parse error: DATASET/test.csv (No columns to parse from file)\n",
301
+ "Local recursive files scanned: 2\n",
302
+ "Local samples added: 19940\n"
303
+ ]
304
+ },
305
+ {
306
+ "name": "stderr",
307
+ "output_type": "stream",
308
+ "text": [
309
+ "Filter: 100%|██████████| 19940/19940 [00:00<00:00, 133317.22 examples/s]\n"
310
+ ]
311
+ },
312
+ {
313
+ "name": "stdout",
314
+ "output_type": "stream",
315
+ "text": [
316
+ "Train samples: 24213\n",
317
+ "Eval samples: 1994\n",
318
+ "Train short (<50 words): 6839\n",
319
+ "Eval short (<50 words): 569\n"
320
+ ]
321
+ }
322
+ ],
323
+ "source": [
324
+ "raw_data = get_combined_dataset(max_samples=cfg.max_samples)\n",
325
+ "train_data = add_short_text_variants(\n",
326
+ " raw_data[\"train\"],\n",
327
+ " short_word_limit=cfg.short_word_limit,\n",
328
+ " ratio=cfg.short_aug_ratio,\n",
329
+ ")\n",
330
+ "eval_data = raw_data[\"test\"]\n",
331
+ "\n",
332
+ "short_train = sum(count_words(t) < 50 for t in train_data[\"text\"])\n",
333
+ "short_eval = sum(count_words(t) < 50 for t in eval_data[\"text\"])\n",
334
+ "\n",
335
+ "print(f\"Train samples: {len(train_data)}\")\n",
336
+ "print(f\"Eval samples: {len(eval_data)}\")\n",
337
+ "print(f\"Train short (<50 words): {short_train}\")\n",
338
+ "print(f\"Eval short (<50 words): {short_eval}\")"
339
+ ]
340
+ },
341
+ {
342
+ "cell_type": "code",
343
+ "execution_count": 7,
344
+ "id": "e8a2ff3e",
345
+ "metadata": {},
346
+ "outputs": [
347
+ {
348
+ "name": "stderr",
349
+ "output_type": "stream",
350
+ "text": [
351
+ "Loading weights: 100%|██████████| 101/101 [00:00<00:00, 8921.80it/s]\n",
352
+ "\u001b[1mRobertaForSequenceClassification LOAD REPORT\u001b[0m from: distilroberta-base\n",
353
+ "Key | Status | \n",
354
+ "----------------------------+------------+-\n",
355
+ "roberta.pooler.dense.weight | UNEXPECTED | \n",
356
+ "lm_head.dense.weight | UNEXPECTED | \n",
357
+ "roberta.pooler.dense.bias | UNEXPECTED | \n",
358
+ "lm_head.layer_norm.bias | UNEXPECTED | \n",
359
+ "lm_head.dense.bias | UNEXPECTED | \n",
360
+ "lm_head.layer_norm.weight | UNEXPECTED | \n",
361
+ "lm_head.bias | UNEXPECTED | \n",
362
+ "classifier.out_proj.bias | MISSING | \n",
363
+ "classifier.dense.weight | MISSING | \n",
364
+ "classifier.dense.bias | MISSING | \n",
365
+ "classifier.out_proj.weight | MISSING | \n",
366
+ "\n",
367
+ "\u001b[3mNotes:\n",
368
+ "- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n",
369
+ "- MISSING\u001b[3m\t:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.\u001b[0m\n",
370
+ "Map: 100%|██████████| 24213/24213 [00:01<00:00, 12285.23 examples/s]\n",
371
+ "Map: 100%|██████████| 1994/1994 [00:00<00:00, 11737.65 examples/s]\n"
372
+ ]
373
+ }
374
+ ],
375
+ "source": [
376
+ "tokenizer = AutoTokenizer.from_pretrained(cfg.base_model_name)\n",
377
+ "model = AutoModelForSequenceClassification.from_pretrained(cfg.base_model_name, num_labels=2).to(DEVICE)\n",
378
+ "\n",
379
+ "\n",
380
+ "def preprocess_batch(batch: dict, tokenizer: PreTrainedTokenizer, max_length: int = 256) -> BatchEncoding:\n",
381
+ " encoded = tokenizer(\n",
382
+ " batch[\"text\"],\n",
383
+ " truncation=True,\n",
384
+ " max_length=max_length,\n",
385
+ " )\n",
386
+ " encoded[\"labels\"] = batch[\"label\"]\n",
387
+ " return encoded\n",
388
+ "\n",
389
+ "\n",
390
+ "tokenize_fn = partial(preprocess_batch, tokenizer=tokenizer, max_length=cfg.max_length)\n",
391
+ "tokenized_train = train_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
392
+ "tokenized_eval = eval_data.map(tokenize_fn, batched=True, remove_columns=[\"text\", \"label\"])\n",
393
+ "\n",
394
+ "columns = tokenized_train.column_names\n",
395
+ "tensor_columns = [name for name in [\"input_ids\", \"attention_mask\", \"token_type_ids\", \"labels\"] if name in columns]\n",
396
+ "tokenized_train.set_format(type=\"torch\", columns=tensor_columns)\n",
397
+ "tokenized_eval.set_format(type=\"torch\", columns=tensor_columns)\n",
398
+ "\n",
399
+ "metric_accuracy = evaluate.load(\"accuracy\")\n",
400
+ "metric_f1 = evaluate.load(\"f1\")\n",
401
+ "\n",
402
+ "\n",
403
+ "def compute_metrics(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:\n",
404
+ " logits, labels = eval_pred\n",
405
+ " if isinstance(logits, tuple):\n",
406
+ " logits = logits[0]\n",
407
+ " preds = np.argmax(logits, axis=1)\n",
408
+ " acc = metric_accuracy.compute(predictions=preds, references=labels)\n",
409
+ " f1 = metric_f1.compute(predictions=preds, references=labels)\n",
410
+ " return {\"accuracy\": float(acc[\"accuracy\"]), \"f1\": float(f1[\"f1\"])}"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "execution_count": null,
416
+ "id": "00f52ac8",
417
+ "metadata": {},
418
+ "outputs": [
419
+ {
420
+ "name": "stdout",
421
+ "output_type": "stream",
422
+ "text": [
423
+ "Start training V2 model...\n"
424
+ ]
425
+ },
426
+ {
427
+ "data": {
428
+ "text/html": [
429
+ "\n",
430
+ " <div>\n",
431
+ " \n",
432
+ " <progress value='4542' max='4542' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
433
+ " [4542/4542 20:20, Epoch 3/3]\n",
434
+ " </div>\n",
435
+ " <table border=\"1\" class=\"dataframe\">\n",
436
+ " <thead>\n",
437
+ " <tr style=\"text-align: left;\">\n",
438
+ " <th>Step</th>\n",
439
+ " <th>Training Loss</th>\n",
440
+ " <th>Validation Loss</th>\n",
441
+ " <th>Accuracy</th>\n",
442
+ " <th>F1</th>\n",
443
+ " </tr>\n",
444
+ " </thead>\n",
445
+ " <tbody>\n",
446
+ " <tr>\n",
447
+ " <td>200</td>\n",
448
+ " <td>0.666410</td>\n",
449
+ " <td>0.350684</td>\n",
450
+ " <td>0.834504</td>\n",
451
+ " <td>0.855390</td>\n",
452
+ " </tr>\n",
453
+ " <tr>\n",
454
+ " <td>400</td>\n",
455
+ " <td>0.598755</td>\n",
456
+ " <td>0.256876</td>\n",
457
+ " <td>0.897192</td>\n",
458
+ " <td>0.904518</td>\n",
459
+ " </tr>\n",
460
+ " <tr>\n",
461
+ " <td>600</td>\n",
462
+ " <td>0.574993</td>\n",
463
+ " <td>0.198666</td>\n",
464
+ " <td>0.919258</td>\n",
465
+ " <td>0.917138</td>\n",
466
+ " </tr>\n",
467
+ " <tr>\n",
468
+ " <td>800</td>\n",
469
+ " <td>0.560090</td>\n",
470
+ " <td>0.555182</td>\n",
471
+ " <td>0.849047</td>\n",
472
+ " <td>0.868040</td>\n",
473
+ " </tr>\n",
474
+ " <tr>\n",
475
+ " <td>1000</td>\n",
476
+ " <td>0.387553</td>\n",
477
+ " <td>0.203730</td>\n",
478
+ " <td>0.929288</td>\n",
479
+ " <td>0.930848</td>\n",
480
+ " </tr>\n",
481
+ " <tr>\n",
482
+ " <td>1200</td>\n",
483
+ " <td>0.411762</td>\n",
484
+ " <td>0.521041</td>\n",
485
+ " <td>0.849047</td>\n",
486
+ " <td>0.868387</td>\n",
487
+ " </tr>\n",
488
+ " <tr>\n",
489
+ " <td>1400</td>\n",
490
+ " <td>0.386610</td>\n",
491
+ " <td>0.348940</td>\n",
492
+ " <td>0.902708</td>\n",
493
+ " <td>0.910434</td>\n",
494
+ " </tr>\n",
495
+ " <tr>\n",
496
+ " <td>1600</td>\n",
497
+ " <td>0.244696</td>\n",
498
+ " <td>0.346382</td>\n",
499
+ " <td>0.916249</td>\n",
500
+ " <td>0.921633</td>\n",
501
+ " </tr>\n",
502
+ " <tr>\n",
503
+ " <td>1800</td>\n",
504
+ " <td>0.223823</td>\n",
505
+ " <td>0.308763</td>\n",
506
+ " <td>0.924774</td>\n",
507
+ " <td>0.928977</td>\n",
508
+ " </tr>\n",
509
+ " <tr>\n",
510
+ " <td>2000</td>\n",
511
+ " <td>0.249242</td>\n",
512
+ " <td>0.358467</td>\n",
513
+ " <td>0.919258</td>\n",
514
+ " <td>0.924307</td>\n",
515
+ " </tr>\n",
516
+ " <tr>\n",
517
+ " <td>2200</td>\n",
518
+ " <td>0.221226</td>\n",
519
+ " <td>0.335397</td>\n",
520
+ " <td>0.919759</td>\n",
521
+ " <td>0.924599</td>\n",
522
+ " </tr>\n",
523
+ " <tr>\n",
524
+ " <td>2400</td>\n",
525
+ " <td>0.221417</td>\n",
526
+ " <td>0.587722</td>\n",
527
+ " <td>0.882648</td>\n",
528
+ " <td>0.894973</td>\n",
529
+ " </tr>\n",
530
+ " <tr>\n",
531
+ " <td>2600</td>\n",
532
+ " <td>0.191291</td>\n",
533
+ " <td>0.329566</td>\n",
534
+ " <td>0.928285</td>\n",
535
+ " <td>0.931677</td>\n",
536
+ " </tr>\n",
537
+ " <tr>\n",
538
+ " <td>2800</td>\n",
539
+ " <td>0.219115</td>\n",
540
+ " <td>0.368331</td>\n",
541
+ " <td>0.919759</td>\n",
542
+ " <td>0.925164</td>\n",
543
+ " </tr>\n",
544
+ " <tr>\n",
545
+ " <td>3000</td>\n",
546
+ " <td>0.308968</td>\n",
547
+ " <td>0.277328</td>\n",
548
+ " <td>0.931795</td>\n",
549
+ " <td>0.934928</td>\n",
550
+ " </tr>\n",
551
+ " <tr>\n",
552
+ " <td>3200</td>\n",
553
+ " <td>0.131352</td>\n",
554
+ " <td>0.585112</td>\n",
555
+ " <td>0.891174</td>\n",
556
+ " <td>0.901854</td>\n",
557
+ " </tr>\n",
558
+ " <tr>\n",
559
+ " <td>3400</td>\n",
560
+ " <td>0.152614</td>\n",
561
+ " <td>0.388915</td>\n",
562
+ " <td>0.924273</td>\n",
563
+ " <td>0.929208</td>\n",
564
+ " </tr>\n",
565
+ " <tr>\n",
566
+ " <td>3600</td>\n",
567
+ " <td>0.145248</td>\n",
568
+ " <td>0.439313</td>\n",
569
+ " <td>0.921765</td>\n",
570
+ " <td>0.926898</td>\n",
571
+ " </tr>\n",
572
+ " <tr>\n",
573
+ " <td>3800</td>\n",
574
+ " <td>0.086042</td>\n",
575
+ " <td>0.467167</td>\n",
576
+ " <td>0.920762</td>\n",
577
+ " <td>0.926099</td>\n",
578
+ " </tr>\n",
579
+ " <tr>\n",
580
+ " <td>4000</td>\n",
581
+ " <td>0.051121</td>\n",
582
+ " <td>0.561893</td>\n",
583
+ " <td>0.909729</td>\n",
584
+ " <td>0.916898</td>\n",
585
+ " </tr>\n",
586
+ " <tr>\n",
587
+ " <td>4200</td>\n",
588
+ " <td>0.141769</td>\n",
589
+ " <td>0.477382</td>\n",
590
+ " <td>0.920762</td>\n",
591
+ " <td>0.926168</td>\n",
592
+ " </tr>\n",
593
+ " <tr>\n",
594
+ " <td>4400</td>\n",
595
+ " <td>0.016825</td>\n",
596
+ " <td>0.506922</td>\n",
597
+ " <td>0.918255</td>\n",
598
+ " <td>0.924151</td>\n",
599
+ " </tr>\n",
600
+ " </tbody>\n",
601
+ "</table><p>"
602
+ ],
603
+ "text/plain": [
604
+ "<IPython.core.display.HTML object>"
605
+ ]
606
+ },
607
+ "metadata": {},
608
+ "output_type": "display_data"
609
+ },
610
+ {
611
+ "name": "stderr",
612
+ "output_type": "stream",
613
+ "text": [
614
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n",
615
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.84it/s]\n",
616
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.64it/s]\n",
617
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
618
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.96it/s]\n",
619
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.07it/s]\n",
620
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.79it/s]\n",
621
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n",
622
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.03it/s]\n",
623
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.03it/s]\n",
624
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
625
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.10it/s]\n",
626
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.59it/s]\n",
627
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.23it/s]\n",
628
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.16it/s]\n",
629
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.19it/s]\n",
630
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
631
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.14it/s]\n",
632
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n",
633
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.21it/s]\n",
634
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 5.17it/s]\n",
635
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.99it/s]\n",
636
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.22it/s]\n",
637
+ "There were missing keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.weight', 'roberta.embeddings.LayerNorm.bias', 'roberta.encoder.layer.0.attention.output.LayerNorm.weight', 'roberta.encoder.layer.0.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.1.attention.output.LayerNorm.weight', 'roberta.encoder.layer.1.attention.output.LayerNorm.bias', 'roberta.encoder.layer.1.output.LayerNorm.weight', 'roberta.encoder.layer.1.output.LayerNorm.bias', 'roberta.encoder.layer.2.attention.output.LayerNorm.weight', 'roberta.encoder.layer.2.attention.output.LayerNorm.bias', 'roberta.encoder.layer.2.output.LayerNorm.weight', 'roberta.encoder.layer.2.output.LayerNorm.bias', 'roberta.encoder.layer.3.attention.output.LayerNorm.weight', 'roberta.encoder.layer.3.attention.output.LayerNorm.bias', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.3.output.LayerNorm.bias', 'roberta.encoder.layer.4.attention.output.LayerNorm.weight', 'roberta.encoder.layer.4.attention.output.LayerNorm.bias', 'roberta.encoder.layer.4.output.LayerNorm.weight', 'roberta.encoder.layer.4.output.LayerNorm.bias', 'roberta.encoder.layer.5.attention.output.LayerNorm.weight', 'roberta.encoder.layer.5.attention.output.LayerNorm.bias', 'roberta.encoder.layer.5.output.LayerNorm.weight', 'roberta.encoder.layer.5.output.LayerNorm.bias'].\n",
638
+ "There were unexpected keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.beta', 'roberta.embeddings.LayerNorm.gamma', 'roberta.encoder.layer.0.attention.output.LayerNorm.beta', 'roberta.encoder.layer.0.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.0.output.LayerNorm.beta', 'roberta.encoder.layer.0.output.LayerNorm.gamma', 'roberta.encoder.layer.1.attention.output.LayerNorm.beta', 'roberta.encoder.layer.1.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.1.output.LayerNorm.beta', 'roberta.encoder.layer.1.output.LayerNorm.gamma', 'roberta.encoder.layer.2.attention.output.LayerNorm.beta', 'roberta.encoder.layer.2.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.2.output.LayerNorm.beta', 'roberta.encoder.layer.2.output.LayerNorm.gamma', 'roberta.encoder.layer.3.attention.output.LayerNorm.beta', 'roberta.encoder.layer.3.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.3.output.LayerNorm.beta', 'roberta.encoder.layer.3.output.LayerNorm.gamma', 'roberta.encoder.layer.4.attention.output.LayerNorm.beta', 'roberta.encoder.layer.4.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.4.output.LayerNorm.beta', 'roberta.encoder.layer.4.output.LayerNorm.gamma', 'roberta.encoder.layer.5.attention.output.LayerNorm.beta', 'roberta.encoder.layer.5.attention.output.LayerNorm.gamma', 'roberta.encoder.layer.5.output.LayerNorm.beta', 'roberta.encoder.layer.5.output.LayerNorm.gamma'].\n"
639
+ ]
640
+ },
641
+ {
642
+ "name": "stdout",
643
+ "output_type": "stream",
644
+ "text": [
645
+ "Final evaluation...\n"
646
+ ]
647
+ },
648
+ {
649
+ "data": {
650
+ "text/html": [
651
+ "\n",
652
+ " <div>\n",
653
+ " \n",
654
+ " <progress value='250' max='250' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
655
+ " [250/250 00:07]\n",
656
+ " </div>\n",
657
+ " "
658
+ ],
659
+ "text/plain": [
660
+ "<IPython.core.display.HTML object>"
661
+ ]
662
+ },
663
+ "metadata": {},
664
+ "output_type": "display_data"
665
+ },
666
+ {
667
+ "ename": "RuntimeError",
668
+ "evalue": "on_train_begin must be called before on_evaluate",
669
+ "output_type": "error",
670
+ "traceback": [
671
+ "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
672
+ "\u001b[31mRuntimeError\u001b[39m Traceback (most recent call last)",
673
+ "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[8]\u001b[39m\u001b[32m, line 55\u001b[39m\n\u001b[32m 52\u001b[39m trainer.train()\n\u001b[32m 54\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mFinal evaluation...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m55\u001b[39m eval_result = \u001b[43mtrainer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mevaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 56\u001b[39m \u001b[38;5;28mprint\u001b[39m(json.dumps(eval_result, indent=\u001b[32m2\u001b[39m, default=\u001b[38;5;28mstr\u001b[39m))\n",
674
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer.py:2602\u001b[39m, in \u001b[36mTrainer.evaluate\u001b[39m\u001b[34m(self, eval_dataset, ignore_keys, metric_key_prefix)\u001b[39m\n\u001b[32m 2599\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m DebugOption.TPU_METRICS_DEBUG \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.args.debug:\n\u001b[32m 2600\u001b[39m xm.master_print(met.metrics_report())\n\u001b[32m-> \u001b[39m\u001b[32m2602\u001b[39m \u001b[38;5;28mself\u001b[39m.control = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcallback_handler\u001b[49m\u001b[43m.\u001b[49m\u001b[43mon_evaluate\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43moutput\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2604\u001b[39m \u001b[38;5;28mself\u001b[39m._memory_tracker.stop_and_update_metrics(output.metrics)\n\u001b[32m 2606\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output.metrics\n",
675
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:524\u001b[39m, in \u001b[36mCallbackHandler.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics)\u001b[39m\n\u001b[32m 522\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics):\n\u001b[32m 523\u001b[39m control.should_evaluate = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m524\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcall_event\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmetrics\u001b[49m\u001b[43m)\u001b[49m\n",
676
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/trainer_callback.py:545\u001b[39m, in \u001b[36mCallbackHandler.call_event\u001b[39m\u001b[34m(self, event, args, state, control, **kwargs)\u001b[39m\n\u001b[32m 543\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mcall_event\u001b[39m(\u001b[38;5;28mself\u001b[39m, event, args, state, control, **kwargs):\n\u001b[32m 544\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m callback \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.callbacks:\n\u001b[32m--> \u001b[39m\u001b[32m545\u001b[39m result = \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mcallback\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mevent\u001b[49m\u001b[43m)\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 546\u001b[39m \u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 547\u001b[39m \u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 548\u001b[39m \u001b[43m \u001b[49m\u001b[43mcontrol\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 549\u001b[39m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 550\u001b[39m \u001b[43m \u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mprocessing_class\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 551\u001b[39m \u001b[43m \u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43moptimizer\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 552\u001b[39m \u001b[43m \u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mlr_scheduler\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 553\u001b[39m \u001b[43m \u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtrain_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 554\u001b[39m \u001b[43m \u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43meval_dataloader\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 555\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 556\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 557\u001b[39m \u001b[38;5;66;03m# A Callback can skip the return of `control` if it doesn't change it.\u001b[39;00m\n\u001b[32m 558\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m result \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
677
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:354\u001b[39m, in \u001b[36mNotebookProgressCallback.on_evaluate\u001b[39m\u001b[34m(self, args, state, control, metrics, **kwargs)\u001b[39m\n\u001b[32m 353\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mon_evaluate\u001b[39m(\u001b[38;5;28mself\u001b[39m, args, state, control, metrics=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m354\u001b[39m tt = \u001b[43m_require\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mtraining_tracker\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mon_train_begin must be called before on_evaluate\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 356\u001b[39m values = {\u001b[33m\"\u001b[39m\u001b[33mTraining Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mValidation Loss\u001b[39m\u001b[33m\"\u001b[39m: \u001b[33m\"\u001b[39m\u001b[33mNo log\u001b[39m\u001b[33m\"\u001b[39m}\n\u001b[32m 357\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m log \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(state.log_history):\n",
678
+ "\u001b[36mFile \u001b[39m\u001b[32m~/miniconda3/envs/ml/lib/python3.11/site-packages/transformers/utils/notebook.py:31\u001b[39m, in \u001b[36m_require\u001b[39m\u001b[34m(x, msg)\u001b[39m\n\u001b[32m 29\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_require\u001b[39m(x: _T | \u001b[38;5;28;01mNone\u001b[39;00m, msg: \u001b[38;5;28mstr\u001b[39m) -> _T:\n\u001b[32m 30\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m x \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m31\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(msg)\n\u001b[32m 32\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m x\n",
679
+ "\u001b[31mRuntimeError\u001b[39m: on_train_begin must be called before on_evaluate"
680
+ ]
681
+ }
682
+ ],
683
+ "source": [
684
+ "import sys\n",
685
+ "import subprocess\n",
686
+ "\n",
687
+ "\n",
688
+ "def _ensure_accelerate(min_version: str = \"1.1.0\") -> None:\n",
689
+ " try:\n",
690
+ " import accelerate # noqa: F401\n",
691
+ " from packaging import version\n",
692
+ "\n",
693
+ " if version.parse(accelerate.__version__) < version.parse(min_version):\n",
694
+ " raise ImportError(f\"accelerate version too old: {accelerate.__version__}\")\n",
695
+ " except Exception:\n",
696
+ " print(\"Installing/upgrading accelerate in current kernel environment...\")\n",
697
+ " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", f\"accelerate>={min_version}\"])\n",
698
+ "\n",
699
+ "\n",
700
+ "_ensure_accelerate()\n",
701
+ "\n",
702
+ "train_args = TrainingArguments(\n",
703
+ " output_dir=\"./results/v2-distilroberta\",\n",
704
+ " num_train_epochs=3,\n",
705
+ " learning_rate=2e-5,\n",
706
+ " weight_decay=0.01,\n",
707
+ " per_device_train_batch_size=8,\n",
708
+ " per_device_eval_batch_size=8,\n",
709
+ " gradient_accumulation_steps=2,\n",
710
+ " fp16=torch.cuda.is_available(),\n",
711
+ " eval_strategy=\"steps\",\n",
712
+ " eval_steps=200,\n",
713
+ " save_strategy=\"steps\",\n",
714
+ " save_steps=200,\n",
715
+ " save_total_limit=2,\n",
716
+ " logging_steps=50,\n",
717
+ " metric_for_best_model=\"eval_f1\",\n",
718
+ " load_best_model_at_end=True,\n",
719
+ " remove_unused_columns=False,\n",
720
+ " report_to=\"none\",\n",
721
+ ")\n",
722
+ "\n",
723
+ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
724
+ "\n",
725
+ "trainer = Trainer(\n",
726
+ " model=model,\n",
727
+ " args=train_args,\n",
728
+ " train_dataset=tokenized_train,\n",
729
+ " eval_dataset=tokenized_eval,\n",
730
+ " data_collator=data_collator,\n",
731
+ " compute_metrics=compute_metrics,\n",
732
+ ")\n",
733
+ "\n",
734
+ "print(\"Start training V2 model...\")\n",
735
+ "train_result = trainer.train()\n",
736
+ "\n",
737
+ "print(\"\\n✓ Training complete!\")\n",
738
+ "print(f\"Final training metrics:\")\n",
739
+ "if hasattr(trainer.state, 'log_history') and trainer.state.log_history:\n",
740
+ " # Get the last evaluation metrics from log history\n",
741
+ " for log_entry in reversed(trainer.state.log_history):\n",
742
+ " if 'eval_loss' in log_entry:\n",
743
+ " print(f\" Eval Loss: {log_entry.get('eval_loss', 'N/A'):.4f}\")\n",
744
+ " print(f\" Eval Accuracy: {log_entry.get('eval_accuracy', 'N/A'):.4f}\")\n",
745
+ " print(f\" Eval F1: {log_entry.get('eval_f1', 'N/A'):.4f}\")\n",
746
+ " break"
747
+ ]
748
+ },
749
+ {
750
+ "cell_type": "code",
751
+ "execution_count": 9,
752
+ "id": "1b601515",
753
+ "metadata": {},
754
+ "outputs": [
755
+ {
756
+ "name": "stderr",
757
+ "output_type": "stream",
758
+ "text": [
759
+ "Writing model shards: 100%|██████████| 1/1 [00:00<00:00, 4.29it/s]"
760
+ ]
761
+ },
762
+ {
763
+ "name": "stdout",
764
+ "output_type": "stream",
765
+ "text": [
766
+ "Saved V2 model to: /mnt/linux-data/Work/aiapi/notebook/ai_vs_human/v2_model\n"
767
+ ]
768
+ },
769
+ {
770
+ "name": "stderr",
771
+ "output_type": "stream",
772
+ "text": [
773
+ "\n"
774
+ ]
775
+ }
776
+ ],
777
+ "source": [
778
+ "save_dir = Path(cfg.output_dir)\n",
779
+ "save_dir.mkdir(parents=True, exist_ok=True)\n",
780
+ "trainer.save_model(str(save_dir))\n",
781
+ "tokenizer.save_pretrained(str(save_dir))\n",
782
+ "\n",
783
+ "label_map = {\"0\": \"human\", \"1\": \"ai\"}\n",
784
+ "(save_dir / \"label_map.json\").write_text(json.dumps(label_map, indent=2), encoding=\"utf-8\")\n",
785
+ "\n",
786
+ "print(f\"Saved V2 model to: {save_dir.resolve()}\")"
787
+ ]
788
+ },
789
+ {
790
+ "cell_type": "code",
791
+ "execution_count": 11,
792
+ "id": "93f0e5a0",
793
+ "metadata": {},
794
+ "outputs": [
795
+ {
796
+ "name": "stdout",
797
+ "output_type": "stream",
798
+ "text": [
799
+ "================================================================================\n",
800
+ "COMPREHENSIVE TEST: All Sentence Types\n",
801
+ "================================================================================\n",
802
+ "\n",
803
+ "1. VERY SHORT SENTENCES (< 10 words):\n",
804
+ " [2 words] human: Hello world.\n",
805
+ " [3 words] human: AI is powerful.\n",
806
+ " [3 words] human: I like coding.\n",
807
+ " [4 words] human: Machine learning works well.\n",
808
+ "\n",
809
+ "2. SHORT SENTENCES (10-50 words):\n",
810
+ " [10 words] human: AI writes fast, but humans add personal experience and emoti...\n",
811
+ " [14 words] human: I woke up late, missed the bus, and ran all the way to class...\n",
812
+ " [11 words] human: This response was generated by a language model in one pass....\n",
813
+ " [17 words] human: The field of data science combines statistics, programming, ...\n",
814
+ "\n",
815
+ "3. MEDIUM SENTENCES (50-150 words):\n",
816
+ " [74 words] human: Artificial intelligence systems can process massive amounts ...\n",
817
+ " [87 words] human: I once tried to learn guitar in a single weekend because I t...\n",
818
+ "\n",
819
+ "4. LONG SENTENCES (150+ words):\n",
820
+ " [153 words] ai: Machine learning represents a subset of artificial intellige...\n",
821
+ "\n",
822
+ "5. EDGE CASES:\n",
823
+ " [1 words] human: 'A'\n",
824
+ " [4 words] human: 'This is a test.'\n",
825
+ " [4 words] human: 'Multiple spaces between words'\n",
826
+ "\n",
827
+ "================================================================================\n",
828
+ "✓ All sentence types tested successfully!\n",
829
+ "================================================================================\n"
830
+ ]
831
+ }
832
+ ],
833
+ "source": [
834
+ "def predict_v2(text: str) -> dict[str, float | int | str]:\n",
835
+ " \"\"\"Predict whether text is AI or human-written. Works for all sentence lengths.\"\"\"\n",
836
+ " cleaned = normalize_text(text)\n",
837
+ " if not cleaned:\n",
838
+ " raise ValueError(\"Input text is empty.\")\n",
839
+ "\n",
840
+ " inputs = tokenizer(\n",
841
+ " cleaned,\n",
842
+ " truncation=True,\n",
843
+ " max_length=cfg.max_length,\n",
844
+ " return_tensors=\"pt\",\n",
845
+ " ).to(model.device)\n",
846
+ "\n",
847
+ " model.eval()\n",
848
+ " with torch.no_grad():\n",
849
+ " logits = model(**inputs).logits\n",
850
+ " probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
851
+ "\n",
852
+ " pred = int(np.argmax(probs))\n",
853
+ " wc = count_words(cleaned)\n",
854
+ "\n",
855
+ " return {\n",
856
+ " \"text\": cleaned,\n",
857
+ " \"word_count\": wc,\n",
858
+ " \"predicted_label\": pred,\n",
859
+ " \"predicted_name\": \"ai\" if pred == 1 else \"human\",\n",
860
+ " \"probability_human\": float(probs[0]),\n",
861
+ " \"probability_ai\": float(probs[1]),\n",
862
+ " \"short_text\": wc < 50,\n",
863
+ " }\n",
864
+ "\n",
865
+ "\n",
866
+ "print(\"=\" * 80)\n",
867
+ "print(\"COMPREHENSIVE TEST: All Sentence Types\")\n",
868
+ "print(\"=\" * 80)\n",
869
+ "\n",
870
+ "# Test 1: Very short sentences (under 10 words)\n",
871
+ "print(\"\\n1. VERY SHORT SENTENCES (< 10 words):\")\n",
872
+ "very_short = [\n",
873
+ " \"Hello world.\",\n",
874
+ " \"AI is powerful.\",\n",
875
+ " \"I like coding.\",\n",
876
+ " \"Machine learning works well.\",\n",
877
+ "]\n",
878
+ "for text in very_short:\n",
879
+ " result = predict_v2(text)\n",
880
+ " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}\")\n",
881
+ "\n",
882
+ "# Test 2: Short sentences (10-50 words)\n",
883
+ "print(\"\\n2. SHORT SENTENCES (10-50 words):\")\n",
884
+ "short_examples = [\n",
885
+ " \"AI writes fast, but humans add personal experience and emotion.\",\n",
886
+ " \"I woke up late, missed the bus, and ran all the way to class.\",\n",
887
+ " \"This response was generated by a language model in one pass.\",\n",
888
+ " \"The field of data science combines statistics, programming, and domain knowledge to extract meaningful insights from data.\",\n",
889
+ "]\n",
890
+ "for text in short_examples:\n",
891
+ " result = predict_v2(text)\n",
892
+ " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
893
+ "\n",
894
+ "# Test 3: Medium sentences (50-150 words)\n",
895
+ "print(\"\\n3. MEDIUM SENTENCES (50-150 words):\")\n",
896
+ "medium_examples = [\n",
897
+ " \"Artificial intelligence systems can process massive amounts of data extremely quickly compared to humans. They are designed to analyze large datasets, identify patterns, and extract useful insights within seconds or minutes. Using advanced algorithms and machine learning models, AI systems can examine structured and unstructured data such as text, images, audio, and numerical information. By learning from historical data, these systems can recognize complex relationships between variables and make accurate predictions about future outcomes.\",\n",
898
+ " \"I once tried to learn guitar in a single weekend because I thought it would be easy. Turns out my fingers had other plans. After two hours of awkward chords and random noises, I realized that music requires patience, practice, and a lot more discipline than I originally expected. My friends laughed when they heard me trying to play, but I kept practicing anyway because I genuinely wanted to improve. Eventually, after weeks of consistent effort, I could finally play a simple song from start to finish.\",\n",
899
+ "]\n",
900
+ "for text in medium_examples:\n",
901
+ " result = predict_v2(text)\n",
902
+ " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
903
+ "\n",
904
+ "# Test 4: Long sentences (150+ words)\n",
905
+ "print(\"\\n4. LONG SENTENCES (150+ words):\")\n",
906
+ "long_examples = [\n",
907
+ " \"Machine learning represents a subset of artificial intelligence that enables computer systems to automatically learn and improve from experience without being explicitly programmed for every single task. The fundamental idea behind machine learning is to develop algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available. This field has grown exponentially over the past few decades, driven by increases in computational power, the availability of large datasets, and breakthroughs in algorithmic approaches. Modern machine learning systems power everything from recommendation engines on streaming platforms to autonomous vehicles, medical diagnosis tools, and natural language processing applications. The three main categories of machine learning include supervised learning, where models are trained on labeled data; unsupervised learning, where patterns are discovered in unlabeled data; and reinforcement learning, where agents learn to make decisions by receiving rewards or penalties for their actions in an environment.\",\n",
908
+ "]\n",
909
+ "for text in long_examples:\n",
910
+ " result = predict_v2(text)\n",
911
+ " print(f\" [{result['word_count']} words] {result['predicted_name']}: {text[:60]}...\")\n",
912
+ "\n",
913
+ "# Test 5: Edge cases\n",
914
+ "print(\"\\n5. EDGE CASES:\")\n",
915
+ "edge_cases = [\n",
916
+ " \"A\", # Single word\n",
917
+ " \"This is a test.\", # Very basic\n",
918
+ " \" Multiple spaces between words \", # Extra whitespace\n",
919
+ "]\n",
920
+ "for text in edge_cases:\n",
921
+ " try:\n",
922
+ " result = predict_v2(text)\n",
923
+ " print(f\" [{result['word_count']} words] {result['predicted_name']}: '{text.strip()}'\")\n",
924
+ " except Exception as e:\n",
925
+ " print(f\" ERROR: {text.strip()[:30]} - {str(e)}\")\n",
926
+ "\n",
927
+ "print(\"\\n\" + \"=\" * 80)\n",
928
+ "print(\"✓ All sentence types tested successfully!\")\n",
929
+ "print(\"=\" * 80)"
930
+ ]
931
+ },
932
+ {
933
+ "cell_type": "code",
934
+ "execution_count": 12,
935
+ "id": "98ef7c7d",
936
+ "metadata": {},
937
+ "outputs": [
938
+ {
939
+ "name": "stdout",
940
+ "output_type": "stream",
941
+ "text": [
942
+ "================================================================================\n",
943
+ "TESTING SAVED V2 MODEL FROM DISK\n",
944
+ "================================================================================\n"
945
+ ]
946
+ },
947
+ {
948
+ "name": "stderr",
949
+ "output_type": "stream",
950
+ "text": [
951
+ "Loading weights: 100%|██████████| 105/105 [00:00<00:00, 8556.64it/s]"
952
+ ]
953
+ },
954
+ {
955
+ "name": "stdout",
956
+ "output_type": "stream",
957
+ "text": [
958
+ "\n",
959
+ "✓ Loaded model from: v2_model\n",
960
+ "\n",
961
+ "Running inference tests:\n",
962
+ " [very short ] human (AI: 0.50%): Hi there!\n",
963
+ " [short ] human (AI: 0.09%): I love programming and building cool projects.\n",
964
+ " [medium ] human (AI: 3.09%): Artificial intelligence has revolutionized many in\n",
965
+ "\n",
966
+ "✓ Saved model works correctly for all sentence types!\n"
967
+ ]
968
+ },
969
+ {
970
+ "name": "stderr",
971
+ "output_type": "stream",
972
+ "text": [
973
+ "\n"
974
+ ]
975
+ }
976
+ ],
977
+ "source": [
978
+ "# Load and test the saved v2_model independently\n",
979
+ "print(\"=\" * 80)\n",
980
+ "print(\"TESTING SAVED V2 MODEL FROM DISK\")\n",
981
+ "print(\"=\" * 80)\n",
982
+ "\n",
983
+ "saved_model_path = Path(cfg.output_dir)\n",
984
+ "if saved_model_path.exists():\n",
985
+ " # Load fresh model and tokenizer from saved checkpoint\n",
986
+ " saved_tokenizer = AutoTokenizer.from_pretrained(str(saved_model_path))\n",
987
+ " saved_model = AutoModelForSequenceClassification.from_pretrained(str(saved_model_path)).to(DEVICE)\n",
988
+ " \n",
989
+ " print(f\"\\n✓ Loaded model from: {saved_model_path}\")\n",
990
+ " \n",
991
+ " # Test with diverse examples\n",
992
+ " test_cases = [\n",
993
+ " (\"Hi there!\", \"very short\"),\n",
994
+ " (\"I love programming and building cool projects.\", \"short\"),\n",
995
+ " (\"Artificial intelligence has revolutionized many industries by enabling automation, improving decision-making, and creating new opportunities for innovation.\", \"medium\"),\n",
996
+ " ]\n",
997
+ " \n",
998
+ " print(\"\\nRunning inference tests:\")\n",
999
+ " for text, category in test_cases:\n",
1000
+ " inputs = saved_tokenizer(text, truncation=True, max_length=256, return_tensors=\"pt\").to(DEVICE)\n",
1001
+ " saved_model.eval()\n",
1002
+ " with torch.no_grad():\n",
1003
+ " logits = saved_model(**inputs).logits\n",
1004
+ " probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]\n",
1005
+ " pred_label = int(np.argmax(probs))\n",
1006
+ " pred_name = \"ai\" if pred_label == 1 else \"human\"\n",
1007
+ " \n",
1008
+ " print(f\" [{category:12}] {pred_name:6} (AI: {probs[1]:.2%}): {text[:50]}\")\n",
1009
+ " \n",
1010
+ " print(\"\\n✓ Saved model works correctly for all sentence types!\")\n",
1011
+ "else:\n",
1012
+ " print(f\"⚠ Model not found at: {saved_model_path}\")\n",
1013
+ " print(\" Run the save cell first to create v2_model/\")"
1014
+ ]
1015
+ },
1016
+ {
1017
+ "cell_type": "code",
1018
+ "execution_count": 13,
1019
+ "id": "2f63e591",
1020
+ "metadata": {},
1021
+ "outputs": [
1022
+ {
1023
+ "name": "stdout",
1024
+ "output_type": "stream",
1025
+ "text": [
1026
+ "================================================================================\n",
1027
+ "EXTREME EDGE CASE TESTING\n",
1028
+ "================================================================================\n",
1029
+ "\n",
1030
+ "Testing extreme edge cases:\n",
1031
+ " ✓ Single character [ 1w] human (99.3%): 'A'\n",
1032
+ " ✓ Single word [ 1w] human (99.4%): 'Hello'\n",
1033
+ " ✓ Two words [ 2w] human (99.6%): 'Hello world'\n",
1034
+ " ✓ Numbers only [ 3w] human (98.7%): '123 456 789'\n",
1035
+ " ✓ Special chars [ 4w] human (99.8%): '!!! ### $$$ ???'\n",
1036
+ " ✓ Mixed alphanumeric [ 3w] human (99.3%): 'Test123 ABC456 xyz789'\n",
1037
+ " ✓ Very long word [ 1w] human (99.1%): 'supercalifragilisticexpialidocious'\n",
1038
+ " ✓ Repeated words [ 5w] human (99.6%): 'test test test test test'\n",
1039
+ " ✓ Newlines [ 6w] human (99.4%): 'Line one\\nLine two\\nLine three'\n",
1040
+ " ✓ Tabs [ 3w] human (99.5%): 'Col1\\tCol2\\tCol3'\n",
1041
+ " ✓ Multiple spaces [ 3w] human (99.7%): 'Too many spaces'\n",
1042
+ " ✓ Punctuation heavy [ 5w] human (99.8%): 'Wow! Really? Yes! No... Maybe?'\n",
1043
+ " ✗ Empty-like ERROR: Input text is empty.\n",
1044
+ " ✓ Mixed case [ 5w] human (99.3%): 'ThIs Is MiXeD cAsE tExT'\n",
1045
+ " ✓ All caps [ 4w] human (99.3%): 'THIS IS ALL CAPITALS'\n",
1046
+ " ✓ All lower [ 4w] human (99.9%): 'this is all lowercase'\n",
1047
+ "\n",
1048
+ "Result: 15 passed, 1 failed\n",
1049
+ "\n",
1050
+ "================================================================================\n",
1051
+ "BATCH PREDICTION TEST\n",
1052
+ "================================================================================\n",
1053
+ "\n",
1054
+ "Predicting batch of mixed-length sentences:\n",
1055
+ "\n",
1056
+ " Sentence 1 (1 words):\n",
1057
+ " Text: Short....\n",
1058
+ " Prediction: human\n",
1059
+ " Confidence: AI=0.1%, Human=99.9%\n",
1060
+ "\n",
1061
+ " Sentence 2 (9 words):\n",
1062
+ " Text: This is a medium length sentence with some content....\n",
1063
+ " Prediction: human\n",
1064
+ " Confidence: AI=0.1%, Human=99.9%\n",
1065
+ "\n",
1066
+ " Sentence 3 (29 words):\n",
1067
+ " Text: This is a longer sentence that contains more words and provi...\n",
1068
+ " Prediction: human\n",
1069
+ " Confidence: AI=0.1%, Human=99.9%\n",
1070
+ "\n",
1071
+ "================================================================================\n",
1072
+ "✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\n",
1073
+ "================================================================================\n"
1074
+ ]
1075
+ }
1076
+ ],
1077
+ "source": [
1078
+ "print(\"=\" * 80)\n",
1079
+ "print(\"EXTREME EDGE CASE TESTING\")\n",
1080
+ "print(\"=\" * 80)\n",
1081
+ "\n",
1082
+ "# Test various edge cases that might break the model\n",
1083
+ "edge_test_cases = {\n",
1084
+ " \"Single character\": \"A\",\n",
1085
+ " \"Single word\": \"Hello\",\n",
1086
+ " \"Two words\": \"Hello world\",\n",
1087
+ " \"Numbers only\": \"123 456 789\",\n",
1088
+ " \"Special chars\": \"!!! ### $$$ ???\",\n",
1089
+ " \"Mixed alphanumeric\": \"Test123 ABC456 xyz789\",\n",
1090
+ " \"Very long word\": \"supercalifragilisticexpialidocious\",\n",
1091
+ " \"Repeated words\": \"test test test test test\",\n",
1092
+ " \"Newlines\": \"Line one\\nLine two\\nLine three\",\n",
1093
+ " \"Tabs\": \"Col1\\tCol2\\tCol3\",\n",
1094
+ " \"Multiple spaces\": \"Too many spaces\",\n",
1095
+ " \"Punctuation heavy\": \"Wow! Really? Yes! No... Maybe?\",\n",
1096
+ " \"Empty-like\": \" \",\n",
1097
+ " \"Mixed case\": \"ThIs Is MiXeD cAsE tExT\",\n",
1098
+ " \"All caps\": \"THIS IS ALL CAPITALS\",\n",
1099
+ " \"All lower\": \"this is all lowercase\",\n",
1100
+ "}\n",
1101
+ "\n",
1102
+ "print(\"\\nTesting extreme edge cases:\")\n",
1103
+ "passed = 0\n",
1104
+ "failed = 0\n",
1105
+ "\n",
1106
+ "for case_name, text in edge_test_cases.items():\n",
1107
+ " try:\n",
1108
+ " result = predict_v2(text)\n",
1109
+ " wc = result['word_count']\n",
1110
+ " pred = result['predicted_name']\n",
1111
+ " conf = result['probability_ai'] if pred == 'ai' else result['probability_human']\n",
1112
+ " \n",
1113
+ " # Handle display of text with special characters\n",
1114
+ " display_text = text.replace('\\n', '\\\\n').replace('\\t', '\\\\t')[:40]\n",
1115
+ " print(f\" ✓ {case_name:20} [{wc:2}w] {pred:6} ({conf:.1%}): '{display_text}'\")\n",
1116
+ " passed += 1\n",
1117
+ " except Exception as e:\n",
1118
+ " print(f\" ✗ {case_name:20} ERROR: {str(e)[:50]}\")\n",
1119
+ " failed += 1\n",
1120
+ "\n",
1121
+ "print(f\"\\nResult: {passed} passed, {failed} failed\")\n",
1122
+ "\n",
1123
+ "# Batch prediction test\n",
1124
+ "print(\"\\n\" + \"=\" * 80)\n",
1125
+ "print(\"BATCH PREDICTION TEST\")\n",
1126
+ "print(\"=\" * 80)\n",
1127
+ "\n",
1128
+ "batch_texts = [\n",
1129
+ " \"Short.\",\n",
1130
+ " \"This is a medium length sentence with some content.\",\n",
1131
+ " \"This is a longer sentence that contains more words and provides more context for the model to analyze and make predictions based on the patterns it learned during training.\",\n",
1132
+ "]\n",
1133
+ "\n",
1134
+ "print(\"\\nPredicting batch of mixed-length sentences:\")\n",
1135
+ "batch_results = [predict_v2(text) for text in batch_texts]\n",
1136
+ "\n",
1137
+ "for i, (text, result) in enumerate(zip(batch_texts, batch_results), 1):\n",
1138
+ " print(f\"\\n Sentence {i} ({result['word_count']} words):\")\n",
1139
+ " print(f\" Text: {text[:60]}...\")\n",
1140
+ " print(f\" Prediction: {result['predicted_name']}\")\n",
1141
+ " print(f\" Confidence: AI={result['probability_ai']:.1%}, Human={result['probability_human']:.1%}\")\n",
1142
+ "\n",
1143
+ "print(\"\\n\" + \"=\" * 80)\n",
1144
+ "print(\"✓ ALL EDGE CASES AND BATCH TESTS COMPLETE!\")\n",
1145
+ "print(\"=\" * 80)"
1146
+ ]
1147
+ }
1148
+ ],
1149
+ "metadata": {
1150
+ "kernelspec": {
1151
+ "display_name": "ml",
1152
+ "language": "python",
1153
+ "name": "python3"
1154
+ },
1155
+ "language_info": {
1156
+ "codemirror_mode": {
1157
+ "name": "ipython",
1158
+ "version": 3
1159
+ },
1160
+ "file_extension": ".py",
1161
+ "mimetype": "text/x-python",
1162
+ "name": "python",
1163
+ "nbconvert_exporter": "python",
1164
+ "pygments_lexer": "ipython3",
1165
+ "version": "3.11.14"
1166
+ }
1167
+ },
1168
+ "nbformat": 4,
1169
+ "nbformat_minor": 5
1170
+ }
notebook/ai_vs_human_nepali/notebook/documentation.md ADDED
@@ -0,0 +1,435 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Nepali AI vs Human Notebook Documentation
2
+
3
+ This folder contains a small notebook series for building an AI-vs-human text detector for Nepali text. The notebooks are not identical copies; they represent the evolution of the project from a lightweight scikit-learn baseline to a stronger hybrid model and a transformer-based experiment.
4
+
5
+ ## Notebook Inventory
6
+
7
+ The notebooks in this directory are:
8
+
9
+ - [main.ipynb](main.ipynb)
10
+ - [working model.ipynb](working%20model.ipynb)
11
+ - [Nepali_Ai_vs_Human.ipynb](Nepali_Ai_vs_Human.ipynb)
12
+ - [final_main.ipynb](final_main.ipynb)
13
+
14
+ ## Shared Goal
15
+
16
+ All notebooks solve the same binary classification task:
17
+
18
+ - Class 0 = Human-written Nepali text
19
+ - Class 1 = AI-generated Nepali text
20
+
21
+ The notebooks differ in how they prepare the data, which features they extract, and which model family they train.
22
+
23
+ ## Shared Data Sources
24
+
25
+ Across the notebooks, the dataset is built from one or more CSV files under the notebook dataset folders. The common column pattern is:
26
+
27
+ - human_text
28
+ - ai_generated_text
29
+
30
+ Some notebooks also use:
31
+
32
+ - title
33
+ - label
34
+ - paragraph
35
+
36
+ The data preparation usually performs some combination of:
37
+
38
+ - dropping null rows
39
+ - stripping whitespace
40
+ - removing duplicates
41
+ - converting two source columns into one text column plus one label column
42
+ - balancing classes by sampling
43
+ - splitting long texts into smaller chunks
44
+
45
+ ## Notebook Relationship
46
+
47
+ The notebooks form a progression:
48
+
49
+ 1. main.ipynb is the first lightweight sklearn baseline.
50
+ 2. working model.ipynb refines the baseline with better text chunking.
51
+ 3. Nepali_Ai_vs_Human.ipynb switches to a transformer-style neural model.
52
+ 4. final_main.ipynb is the most complete hybrid notebook and is the closest thing to a production workflow.
53
+
54
+ ## main.ipynb
55
+
56
+ ### Purpose
57
+
58
+ This is the earliest baseline notebook. It focuses on a CPU-friendly approach using TF-IDF plus hand-crafted text features, then compares several classic machine learning models.
59
+
60
+ ### Data Preparation
61
+
62
+ The notebook loads several CSV files and concatenates them into one dataframe. The data is drawn from:
63
+
64
+ - ../DATASET/data.csv
65
+ - ../DATASET/new_data.csv
66
+ - /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/news_scrap_new2.fixed.csv
67
+
68
+ The notebook creates separate cleaned columns for human text and AI text, then stacks them into a single training dataframe with labels.
69
+
70
+ Important preprocessing steps:
71
+
72
+ - remove URLs
73
+ - keep only Nepali Unicode characters and whitespace
74
+ - lowercase the text
75
+ - remove consecutive repeated words
76
+
77
+ ### Feature Engineering
78
+
79
+ The notebook combines two feature families:
80
+
81
+ - Word-level TF-IDF with 1-2 gram features
82
+ - Dense, hand-crafted features based on text structure
83
+
84
+ The hand-crafted features include:
85
+
86
+ - burstiness statistics from sentence lengths
87
+ - average word length
88
+ - average sentence length
89
+ - lexical diversity
90
+ - punctuation ratio
91
+ - repeated bigram ratio
92
+ - Devanagari diacritic density
93
+
94
+ The sparse TF-IDF matrix is concatenated with the dense feature matrix using horizontal stacking.
95
+
96
+ ### Models Trained
97
+
98
+ The notebook compares several standard classifiers:
99
+
100
+ - LogisticRegressionCV
101
+ - RidgeClassifierCV
102
+ - MultinomialNB
103
+ - BernoulliNB
104
+ - RandomForestClassifier
105
+ - GradientBoostingClassifier
106
+ - LinearSVC
107
+ - KNeighborsClassifier
108
+
109
+ Dense conversion is applied only where needed, such as for LinearSVC and KNeighbors.
110
+
111
+ ### Evaluation
112
+
113
+ The notebook evaluates the models with:
114
+
115
+ - validation accuracy
116
+ - weighted F1 score
117
+ - classification reports
118
+ - confusion matrices
119
+ - ROC curves
120
+
121
+ The top models are selected by validation accuracy and re-used in later prediction cells.
122
+
123
+ ### Prediction Demo
124
+
125
+ The notebook includes several sample Nepali texts for inference. It prints per-model predictions and, where possible, confidence values.
126
+
127
+ ### Saved Artifacts
128
+
129
+ Each model is saved as a pickle file in a local saved_models directory.
130
+
131
+ ### Known Issues
132
+
133
+ - Several cells are duplicated, especially the dataset loading cells.
134
+ - The vectorizer and the feature builder are not saved with the models, so full reloading is incomplete.
135
+ - There are repeated prediction sections, which makes the notebook harder to maintain.
136
+ - Some cells appear to be placeholders or empty.
137
+
138
+ ## working model.ipynb
139
+
140
+ ### Purpose
141
+
142
+ This notebook is a refinement of main.ipynb. It keeps the same overall classifier strategy but improves how long Nepali articles are handled.
143
+
144
+ ### Main Difference From main.ipynb
145
+
146
+ The key improvement is sentence chunking:
147
+
148
+ - long texts are split into smaller chunks
149
+ - chunk boundaries prefer Nepali danda punctuation
150
+ - each chunk is limited to a small number of sentences or words
151
+
152
+ This makes the dataset more granular and helps the classifier train on smaller, more uniform samples.
153
+
154
+ ### Preprocessing
155
+
156
+ The notebook defines:
157
+
158
+ - clean_text
159
+ - remove_auto_repeating
160
+ - split_into_sentence_chunks
161
+ - expand_texts_to_chunks
162
+
163
+ These functions preserve sentence punctuation for chunking, then normalize the cleaned chunks for downstream training.
164
+
165
+ ### Feature Engineering and Models
166
+
167
+ The rest of the pipeline is essentially the same as main.ipynb:
168
+
169
+ - TF-IDF word n-grams
170
+ - burstiness and stylometric features
171
+ - concatenated sparse + dense representation
172
+ - the same family of sklearn classifiers
173
+
174
+ ### Evaluation and Inference
175
+
176
+ The notebook follows the same model comparison, ROC plotting, confusion matrix plotting, and sample prediction pattern as the baseline notebook.
177
+
178
+ ### Saved Artifacts
179
+
180
+ Like main.ipynb, the fitted sklearn models are stored under saved_models as individual pickle files.
181
+
182
+ ### Known Issues
183
+
184
+ - The notebook has redundant cells and repeated code blocks.
185
+ - It still does not serialize the vectorizer and feature transformer together with the model artifacts.
186
+ - Some prediction logic is repeated more than once.
187
+
188
+ ## Nepali_Ai_vs_Human.ipynb
189
+
190
+ ### Purpose
191
+
192
+ This notebook is the deep learning branch of the project. Instead of hand-crafted features plus classical classifiers, it uses a transformer encoder with a classification head.
193
+
194
+ ### Data Preparation
195
+
196
+ The notebook reads one CSV file and converts the two-column source format into a single text-label dataframe.
197
+
198
+ Important preparation steps:
199
+
200
+ - validate required columns
201
+ - drop nulls
202
+ - build a unified dataframe with text and label
203
+ - filter short texts
204
+ - drop duplicate text rows
205
+ - shuffle the dataset
206
+
207
+ The notebook keeps the raw text mostly intact rather than applying aggressive regex cleaning.
208
+
209
+ ### Model Architecture
210
+
211
+ The model pipeline is built around Hugging Face transformers and PyTorch:
212
+
213
+ - tokenizer from a multilingual BERT-style model
214
+ - AutoModel backbone
215
+ - classification head with dropout
216
+ - binary output layer
217
+
218
+ The notebook defines a custom PyTorch module named IndicBERTClassifier.
219
+
220
+ ### Training Setup
221
+
222
+ The notebook uses:
223
+
224
+ - train/validation split with stratification
225
+ - DataLoader-based batching
226
+ - AdamW optimizer
227
+ - cross-entropy loss
228
+ - linear warmup scheduler
229
+ - gradient accumulation
230
+ - mixed precision when CUDA is available
231
+ - early stopping on validation F1
232
+
233
+ This makes it more GPU-oriented than the sklearn notebooks.
234
+
235
+ ### Evaluation
236
+
237
+ Per-epoch evaluation includes:
238
+
239
+ - accuracy
240
+ - F1 score
241
+ - classification report
242
+
243
+ The notebook also saves improved checkpoints when validation F1 improves.
244
+
245
+ ### Prediction Demo
246
+
247
+ The notebook defines a predict function that:
248
+
249
+ - tokenizes the input text
250
+ - runs the transformer model
251
+ - applies softmax
252
+ - returns the predicted class and confidence
253
+
254
+ Several sample Nepali sentences are passed through the predictor at the end of the notebook.
255
+
256
+ ### Saved Artifacts
257
+
258
+ The notebook saves:
259
+
260
+ - model_best.pth
261
+ - model_latest.pth
262
+ - tokenizer files in ./nepali_xlmr_classifier
263
+
264
+ There is also a Colab-oriented zip export section.
265
+
266
+ ### Known Issues
267
+
268
+ - The notebook mixes local notebook execution with Colab-specific code.
269
+ - Some cells show CUDA or environment-related warnings.
270
+ - The training flow is more complex and less polished than the final hybrid notebook.
271
+ - Paths are hard-coded in a few places.
272
+
273
+ ## final_main.ipynb
274
+
275
+ ### Purpose
276
+
277
+ This is the most complete notebook in the folder. It combines semantic embeddings from Sentence Transformers with stylometric features, then trains a linear model and an XGBoost model on the fused feature vector.
278
+
279
+ ### Data Preparation
280
+
281
+ The notebook reads the dataset from:
282
+
283
+ - ../DATASET/Final_data/final_news345.csv
284
+ - /mnt/linux-data/Work/aiapi/notebook/ai_vs_human_nepali/Final_data/final_news345.csv
285
+
286
+ The notebook expects a label column with string values and maps them to binary classes.
287
+
288
+ It also includes a preprocessing utility that can:
289
+
290
+ - split very long Nepali texts into chunks
291
+ - preserve danda-based sentence boundaries
292
+ - filter out extremely short chunks
293
+ - balance the dataset by sampling each class to the same count
294
+
295
+ ### Visualization
296
+
297
+ The notebook includes exploratory plots for:
298
+
299
+ - class distribution
300
+ - character count distribution
301
+ - word count distribution
302
+ - sentence count distribution
303
+ - cleaned text length distribution
304
+ - stylometric feature comparison plots
305
+
306
+ This makes it the most documented and inspection-friendly notebook in the folder.
307
+
308
+ ### Text Cleaning
309
+
310
+ The notebook defines clean_nepali_text, which:
311
+
312
+ - lowercases the text
313
+ - normalizes Nepali and common Unicode punctuation
314
+ - removes unwanted characters
315
+ - collapses repeated whitespace
316
+ - trims the result
317
+
318
+ This cleaned text is used for both embeddings and stylometric extraction.
319
+
320
+ ### Stylometric Features
321
+
322
+ The notebook uses six hand-crafted features:
323
+
324
+ - word_count
325
+ - sentence_count
326
+ - avg_word_length
327
+ - avg_sentence_length
328
+ - type_token_ratio
329
+ - punctuation_ratio
330
+
331
+ These features are extracted from the cleaned text and then standardized with StandardScaler.
332
+
333
+ ### Semantic Embeddings
334
+
335
+ The notebook uses the Sentence Transformers model:
336
+
337
+ - sentence-transformers/paraphrase-multilingual-mpnet-base-v2
338
+
339
+ This produces 768-dimensional multilingual sentence embeddings. The notebook loads the embedder on CPU to reduce CUDA memory pressure.
340
+
341
+ ### Feature Fusion
342
+
343
+ The final feature matrix is built by concatenating:
344
+
345
+ - 768 embedding dimensions
346
+ - 6 scaled stylometric dimensions
347
+
348
+ So each sample becomes a 774-dimensional vector.
349
+
350
+ ### Models Trained
351
+
352
+ Two models are trained on the fused features:
353
+
354
+ - Logistic Regression
355
+ - XGBoost
356
+
357
+ XGBoost is configured with class imbalance handling through scale_pos_weight.
358
+
359
+ ### Evaluation
360
+
361
+ The notebook evaluates both models using:
362
+
363
+ - accuracy
364
+ - precision
365
+ - recall
366
+ - F1 score
367
+ - confusion matrices
368
+ - ROC curves and AUC
369
+
370
+ It also computes and visualizes XGBoost feature importance.
371
+
372
+ ### Prediction Flow
373
+
374
+ The prediction function follows this exact sequence:
375
+
376
+ 1. clean the input
377
+ 2. extract stylometric features
378
+ 3. build the sentence embedding
379
+ 4. scale the stylometric vector
380
+ 5. concatenate the two feature blocks
381
+ 6. predict with XGBoost
382
+
383
+ The function returns a dictionary containing the label, numeric class id, and probability.
384
+
385
+ ### Saved Artifacts
386
+
387
+ The notebook saves a joblib bundle at:
388
+
389
+ - ../models/ai_text_detector_model.pkl
390
+
391
+ The saved artifact includes:
392
+
393
+ - xgb_model
394
+ - lr_model
395
+ - scaler
396
+ - embed_model name string
397
+ - stylo_cols
398
+ - label_map
399
+
400
+ ### Known Issues
401
+
402
+ - The XGBoost fit call uses the test set as an eval_set, which is acceptable for monitoring but not ideal if you want strict validation separation.
403
+ - The embedding model name is saved, but the embedder itself is not serialized.
404
+ - The notebook is the strongest production candidate, but it still lacks a separate load-and-predict helper for end users.
405
+
406
+ ## Comparison Summary
407
+
408
+ | Notebook | Main Approach | Strength | Weakness |
409
+ |---|---|---|---|
410
+ | main.ipynb | TF-IDF + stylometry + classic ML | Simple baseline, easy to inspect | Repetitive and not fully serializable |
411
+ | working model.ipynb | TF-IDF + stylometry + chunking | Better handling of long text | Still mostly a baseline notebook |
412
+ | Nepali_Ai_vs_Human.ipynb | Transformer classifier | Strong semantic modeling | Heavier, more environment-sensitive |
413
+ | final_main.ipynb | Sentence embeddings + stylometry + XGBoost | Best balance of performance, clarity, and deployability | Uses a saved model name string instead of serializing the embedder |
414
+
415
+ ## Recommended Reading Order
416
+
417
+ If you want to understand the project evolution, read the notebooks in this order:
418
+
419
+ 1. main.ipynb
420
+ 2. working model.ipynb
421
+ 3. Nepali_Ai_vs_Human.ipynb
422
+ 4. final_main.ipynb
423
+
424
+ If you only want the most useful notebook for reuse or deployment, start with final_main.ipynb.
425
+
426
+ ## Practical Notes
427
+
428
+ - Several notebooks contain duplicated or stale cells from experimentation.
429
+ - Not every cell has been executed successfully.
430
+ - Paths are sometimes hard-coded for the local workspace, so moving the folder may require path cleanup.
431
+ - The project alternates between three styles of modeling: classical sklearn, transformer fine-tuning, and hybrid embedding-based classification.
432
+
433
+ ## Suggested Next Step
434
+
435
+ If you want, the next useful document to add is an inference guide that explains how to load the saved model bundle from final_main.ipynb and run predictions on new Nepali text.