Ubuntu commited on Oct 16, 2023

Commit

07eb0e9

1 Parent(s): e77b318

added Azure NER

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

__pycache__/keys.cpython-310.pyc +0 -0
data/wolf_cut_labelled.csv +3 -0
data/wolf_cut_temp.csv +3 -0
data_intent/intent_data.csv +2 -2
data_intent/temp.csv +3 -0
finetuned_entity_categorical_classification/checkpoint-1681/optimizer.pt +1 -1
finetuned_entity_categorical_classification/checkpoint-1681/pytorch_model.bin +1 -1
finetuned_entity_categorical_classification/checkpoint-1681/rng_state.pth +0 -0
finetuned_entity_categorical_classification/checkpoint-1681/trainer_state.json +10 -10
finetuned_entity_categorical_classification/checkpoint-1681/training_args.bin +1 -1
finetuned_entity_categorical_classification/checkpoint-3362/optimizer.pt +1 -1
finetuned_entity_categorical_classification/checkpoint-3362/pytorch_model.bin +1 -1
finetuned_entity_categorical_classification/checkpoint-3362/rng_state.pth +0 -0
finetuned_entity_categorical_classification/checkpoint-3362/trainer_state.json +18 -18
finetuned_entity_categorical_classification/checkpoint-3362/training_args.bin +1 -1
finetuned_entity_categorical_classification/runs/Oct13_10-29-55_ip-172-31-95-165/events.out.tfevents.1697192996.ip-172-31-95-165.139501.0 +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/added_tokens.json +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/config.json +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/optimizer.pt +1 -1
intent_classification_model/{checkpoint-324 → checkpoint-1216}/pytorch_model.bin +1 -1
intent_classification_model/checkpoint-1216/rng_state.pth +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/scheduler.pt +1 -1
intent_classification_model/{checkpoint-324 → checkpoint-1216}/special_tokens_map.json +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/tokenizer.json +0 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/tokenizer_config.json +0 -0
intent_classification_model/checkpoint-1216/trainer_state.json +175 -0
intent_classification_model/{checkpoint-324 → checkpoint-1216}/training_args.bin +1 -1
intent_classification_model/{checkpoint-324 → checkpoint-1216}/vocab.txt +0 -0
intent_classification_model/checkpoint-1376/added_tokens.json +7 -0
intent_classification_model/checkpoint-1376/config.json +39 -0
intent_classification_model/checkpoint-1376/optimizer.pt +3 -0
intent_classification_model/checkpoint-1376/pytorch_model.bin +3 -0
intent_classification_model/{checkpoint-324 → checkpoint-1376}/rng_state.pth +0 -0
intent_classification_model/checkpoint-1376/scheduler.pt +3 -0
intent_classification_model/checkpoint-1376/special_tokens_map.json +7 -0
intent_classification_model/checkpoint-1376/tokenizer.json +0 -0
intent_classification_model/checkpoint-1376/tokenizer_config.json +56 -0
intent_classification_model/checkpoint-1376/trainer_state.json +175 -0
intent_classification_model/checkpoint-1376/training_args.bin +3 -0
intent_classification_model/checkpoint-1376/vocab.txt +0 -0
intent_classification_model/checkpoint-324/trainer_state.json +0 -73
intent_classification_model/runs/Oct13_10-35-17_ip-172-31-95-165/events.out.tfevents.1697193318.ip-172-31-95-165.139816.0 +0 -0
intent_classification_model/runs/Oct13_10-49-20_ip-172-31-95-165/events.out.tfevents.1697194161.ip-172-31-95-165.140238.0 +0 -0
research/09_fine_tuning_for_datacategories.ipynb +122 -115
research/11_evaluation.ipynb +258 -50
research/11_intent_classification_using_distilbert.ipynb +255 -143
research/12_text_analytics_using_azure.ipynb +407 -0
research/13_data_categories.ipynb +0 -0
utils/__pycache__/get_category.cpython-310.pyc +0 -0
utils/__pycache__/get_intent.cpython-310.pyc +0 -0

__pycache__/keys.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/keys.cpython-310.pyc and b/__pycache__/keys.cpython-310.pyc differ

data/wolf_cut_labelled.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:809d5432ceb512c742171eaefe4862dcc283674b8eab13eacf17ff15595fc16a
+size 278211

data/wolf_cut_temp.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7a72974667af5a81b8012edba66f761f6c6784d03658413c37db06b0e94f0fb
+size 52781

data_intent/intent_data.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:24091e2e977d444be178138ac717fa57b8d16534dcf5e66d4084cf3f77e6f6ce
-size 39551

 version https://git-lfs.github.com/spec/v1
+oid sha256:2ee34445e32b84ac258ad523d7c6b1c6babf326a6932ae05f4a9aeae01ae4366
+size 72303

data_intent/temp.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1c89381303aa0fec070d7141d2e3ad2699daf9d0fb0c2a99eec7625c41977b62
+size 632216

finetuned_entity_categorical_classification/checkpoint-1681/optimizer.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7482411d85a2d5cf5f632c997d2e07449fe4217bcf4b1aad0b38f9138d1acd0a
 size 535881018

 version https://git-lfs.github.com/spec/v1
+oid sha256:7ddb82ef6b7ce9d69183007173cd0480840f0e859a1284293e8d83debea834d5
 size 535881018

finetuned_entity_categorical_classification/checkpoint-1681/pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f30aacfea59fa26f3b7edc0f510fe6d083c82c0a92e3118f80f0b13f375cb74e
 size 267932842

 version https://git-lfs.github.com/spec/v1
+oid sha256:1026e1cb049c206c60d220d76f2ad9cccabbb8a8e435bf46049bfcbb6b973a7f
 size 267932842

finetuned_entity_categorical_classification/checkpoint-1681/rng_state.pth CHANGED Viewed

Binary files a/finetuned_entity_categorical_classification/checkpoint-1681/rng_state.pth and b/finetuned_entity_categorical_classification/checkpoint-1681/rng_state.pth differ

finetuned_entity_categorical_classification/checkpoint-1681/trainer_state.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "best_metric": 0.10296357423067093,
   "best_model_checkpoint": "finetuned_entity_categorical_classification/checkpoint-1681",
   "epoch": 1.0,
   "eval_steps": 500,
@@ -11,28 +11,28 @@
     {
       "epoch": 0.3,
       "learning_rate": 1.7025580011897683e-05,
-      "loss": 0.1045,
       "step": 500
     },
     {
       "epoch": 0.59,
       "learning_rate": 1.405116002379536e-05,
-      "loss": 0.1056,
       "step": 1000
     },
     {
       "epoch": 0.89,
       "learning_rate": 1.1076740035693041e-05,
-      "loss": 0.1041,
       "step": 1500
     },
     {
       "epoch": 1.0,
-      "eval_accuracy": 0.9721850364420646,
-      "eval_loss": 0.10296357423067093,
-      "eval_runtime": 2.316,
-      "eval_samples_per_second": 2902.854,
-      "eval_steps_per_second": 181.779,
       "step": 1681
     }
   ],
@@ -40,7 +40,7 @@
   "max_steps": 3362,
   "num_train_epochs": 2,
   "save_steps": 500,
-  "total_flos": 108413372385396.0,
   "trial_name": null,
   "trial_params": null
 }

 {
+  "best_metric": 0.07765195518732071,
   "best_model_checkpoint": "finetuned_entity_categorical_classification/checkpoint-1681",
   "epoch": 1.0,
   "eval_steps": 500,
     {
       "epoch": 0.3,
       "learning_rate": 1.7025580011897683e-05,
+      "loss": 0.1008,
       "step": 500
     },
     {
       "epoch": 0.59,
       "learning_rate": 1.405116002379536e-05,
+      "loss": 0.1133,
       "step": 1000
     },
     {
       "epoch": 0.89,
       "learning_rate": 1.1076740035693041e-05,
+      "loss": 0.1023,
       "step": 1500
     },
     {
       "epoch": 1.0,
+      "eval_accuracy": 0.9753086419753086,
+      "eval_loss": 0.07765195518732071,
+      "eval_runtime": 2.2887,
+      "eval_samples_per_second": 2937.427,
+      "eval_steps_per_second": 183.944,
       "step": 1681
     }
   ],
   "max_steps": 3362,
   "num_train_epochs": 2,
   "save_steps": 500,
+  "total_flos": 106434534943386.0,
   "trial_name": null,
   "trial_params": null
 }

finetuned_entity_categorical_classification/checkpoint-1681/training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2de83bc1893d1870cbe886f5287e02f718e1fe0be09dba843ccfc561aeb95ec6
 size 4600

 version https://git-lfs.github.com/spec/v1
+oid sha256:38ca296b683b24f6f80d4f29a9a0c986a837732910bd0a31303095257578ddfb
 size 4600

finetuned_entity_categorical_classification/checkpoint-3362/optimizer.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d015879f29a2744736a3ba7748885a4ec943584a74c779bc00637389c2d90ccd
 size 535881018

 version https://git-lfs.github.com/spec/v1
+oid sha256:167b28137ba8f1cd7b5e16c91eb0e53bf3273a77a9f450b8f88896a8fc0333a5
 size 535881018

finetuned_entity_categorical_classification/checkpoint-3362/pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a2f9ac5b4263d73b4fe5715bd69766cb18cb5925f401945d0c67275a65364524
 size 267932842

 version https://git-lfs.github.com/spec/v1
+oid sha256:c4394c17645f6749fa890492765494e6f6dcf094a971ee68dff1d187d6339a1d
 size 267932842

finetuned_entity_categorical_classification/checkpoint-3362/rng_state.pth CHANGED Viewed

Binary files a/finetuned_entity_categorical_classification/checkpoint-3362/rng_state.pth and b/finetuned_entity_categorical_classification/checkpoint-3362/rng_state.pth differ

finetuned_entity_categorical_classification/checkpoint-3362/trainer_state.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "best_metric": 0.10296357423067093,
   "best_model_checkpoint": "finetuned_entity_categorical_classification/checkpoint-1681",
   "epoch": 2.0,
   "eval_steps": 500,
@@ -11,55 +11,55 @@
     {
       "epoch": 0.3,
       "learning_rate": 1.7025580011897683e-05,
-      "loss": 0.1045,
       "step": 500
     },
     {
       "epoch": 0.59,
       "learning_rate": 1.405116002379536e-05,
-      "loss": 0.1056,
       "step": 1000
     },
     {
       "epoch": 0.89,
       "learning_rate": 1.1076740035693041e-05,
-      "loss": 0.1041,
       "step": 1500
     },
     {
       "epoch": 1.0,
-      "eval_accuracy": 0.9721850364420646,
-      "eval_loss": 0.10296357423067093,
-      "eval_runtime": 2.316,
-      "eval_samples_per_second": 2902.854,
-      "eval_steps_per_second": 181.779,
       "step": 1681
     },
     {
       "epoch": 1.19,
       "learning_rate": 8.10232004759072e-06,
-      "loss": 0.0776,
       "step": 2000
     },
     {
       "epoch": 1.49,
       "learning_rate": 5.1279000594884e-06,
-      "loss": 0.0675,
       "step": 2500
     },
     {
       "epoch": 1.78,
       "learning_rate": 2.1534800713860798e-06,
-      "loss": 0.0773,
       "step": 3000
     },
     {
       "epoch": 2.0,
-      "eval_accuracy": 0.9708463483563885,
-      "eval_loss": 0.11056160181760788,
-      "eval_runtime": 2.2742,
-      "eval_samples_per_second": 2956.182,
-      "eval_steps_per_second": 185.119,
       "step": 3362
     }
   ],
@@ -67,7 +67,7 @@
   "max_steps": 3362,
   "num_train_epochs": 2,
   "save_steps": 500,
-  "total_flos": 216609059710134.0,
   "trial_name": null,
   "trial_params": null
 }

 {
+  "best_metric": 0.07765195518732071,
   "best_model_checkpoint": "finetuned_entity_categorical_classification/checkpoint-1681",
   "epoch": 2.0,
   "eval_steps": 500,
     {
       "epoch": 0.3,
       "learning_rate": 1.7025580011897683e-05,
+      "loss": 0.1008,
       "step": 500
     },
     {
       "epoch": 0.59,
       "learning_rate": 1.405116002379536e-05,
+      "loss": 0.1133,
       "step": 1000
     },
     {
       "epoch": 0.89,
       "learning_rate": 1.1076740035693041e-05,
+      "loss": 0.1023,
       "step": 1500
     },
     {
       "epoch": 1.0,
+      "eval_accuracy": 0.9753086419753086,
+      "eval_loss": 0.07765195518732071,
+      "eval_runtime": 2.2887,
+      "eval_samples_per_second": 2937.427,
+      "eval_steps_per_second": 183.944,
       "step": 1681
     },
     {
       "epoch": 1.19,
       "learning_rate": 8.10232004759072e-06,
+      "loss": 0.0827,
       "step": 2000
     },
     {
       "epoch": 1.49,
       "learning_rate": 5.1279000594884e-06,
+      "loss": 0.0702,
       "step": 2500
     },
     {
       "epoch": 1.78,
       "learning_rate": 2.1534800713860798e-06,
+      "loss": 0.0834,
       "step": 3000
     },
     {
       "epoch": 2.0,
+      "eval_accuracy": 0.9747136694927859,
+      "eval_loss": 0.08629146963357925,
+      "eval_runtime": 2.3024,
+      "eval_samples_per_second": 2919.969,
+      "eval_steps_per_second": 182.851,
       "step": 3362
     }
   ],
   "max_steps": 3362,
   "num_train_epochs": 2,
   "save_steps": 500,
+  "total_flos": 213673546900476.0,
   "trial_name": null,
   "trial_params": null
 }

finetuned_entity_categorical_classification/checkpoint-3362/training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2de83bc1893d1870cbe886f5287e02f718e1fe0be09dba843ccfc561aeb95ec6
 size 4600

 version https://git-lfs.github.com/spec/v1
+oid sha256:38ca296b683b24f6f80d4f29a9a0c986a837732910bd0a31303095257578ddfb
 size 4600

finetuned_entity_categorical_classification/runs/Oct13_10-29-55_ip-172-31-95-165/events.out.tfevents.1697192996.ip-172-31-95-165.139501.0 ADDED Viewed

Binary file (7.68 kB). View file

intent_classification_model/{checkpoint-324 → checkpoint-1216}/added_tokens.json RENAMED Viewed

File without changes

intent_classification_model/{checkpoint-324 → checkpoint-1216}/config.json RENAMED Viewed

File without changes

intent_classification_model/{checkpoint-324 → checkpoint-1216}/optimizer.pt RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a50f88f7a9097ecddb2b3c7e3d38747deec4ca3a386132fac9e0e4efaa82ae0e
 size 535745722

 version https://git-lfs.github.com/spec/v1
+oid sha256:97791790fb47e0d2262cfd6c379f3e36d956e7ef05ddcfcd905abba63c990209
 size 535745722

intent_classification_model/{checkpoint-324 → checkpoint-1216}/pytorch_model.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b339df5c0d892e025a1749d085ab010e551f4b249eb497812a1a3bd7ebd5fd99
 size 267865194

 version https://git-lfs.github.com/spec/v1
+oid sha256:3d83acd64be6fc794a8e6c94f48eb095fd23679e7c612bd83712b5738588b1b8
 size 267865194

intent_classification_model/checkpoint-1216/rng_state.pth ADDED Viewed

Binary file (14.2 kB). View file

intent_classification_model/{checkpoint-324 → checkpoint-1216}/scheduler.pt RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:73f74582c189fe624f606122980ccb279125588a1db45b4052dc704fa2b51184
 size 1064

 version https://git-lfs.github.com/spec/v1
+oid sha256:a94db5976ef19e649b033b8c416b03f555990a66e540f81cc5eccc167168f1bc
 size 1064

intent_classification_model/{checkpoint-324 → checkpoint-1216}/special_tokens_map.json RENAMED Viewed

File without changes

intent_classification_model/{checkpoint-324 → checkpoint-1216}/tokenizer.json RENAMED Viewed

File without changes

intent_classification_model/{checkpoint-324 → checkpoint-1216}/tokenizer_config.json RENAMED Viewed

File without changes

intent_classification_model/checkpoint-1216/trainer_state.json ADDED Viewed

	@@ -0,0 +1,175 @@

+{
+  "best_metric": 0.06275933235883713,
+  "best_model_checkpoint": "intent_classification_model/checkpoint-152",
+  "epoch": 16.0,
+  "eval_steps": 500,
+  "global_step": 1216,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 1.0,
+      "eval_accuracy": 0.9867549668874173,
+      "eval_loss": 0.20886486768722534,
+      "eval_runtime": 0.1475,
+      "eval_samples_per_second": 2048.099,
+      "eval_steps_per_second": 128.854,
+      "step": 76
+    },
+    {
+      "epoch": 2.0,
+      "eval_accuracy": 0.9834437086092715,
+      "eval_loss": 0.06275933235883713,
+      "eval_runtime": 0.1586,
+      "eval_samples_per_second": 1904.103,
+      "eval_steps_per_second": 119.795,
+      "step": 152
+    },
+    {
+      "epoch": 3.0,
+      "eval_accuracy": 0.9867549668874173,
+      "eval_loss": 0.06509935110807419,
+      "eval_runtime": 0.1445,
+      "eval_samples_per_second": 2090.586,
+      "eval_steps_per_second": 131.527,
+      "step": 228
+    },
+    {
+      "epoch": 4.0,
+      "eval_accuracy": 0.9768211920529801,
+      "eval_loss": 0.08112386614084244,
+      "eval_runtime": 0.1335,
+      "eval_samples_per_second": 2262.833,
+      "eval_steps_per_second": 142.364,
+      "step": 304
+    },
+    {
+      "epoch": 5.0,
+      "eval_accuracy": 0.9701986754966887,
+      "eval_loss": 0.11257749050855637,
+      "eval_runtime": 0.134,
+      "eval_samples_per_second": 2253.71,
+      "eval_steps_per_second": 141.79,
+      "step": 380
+    },
+    {
+      "epoch": 6.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.11174333095550537,
+      "eval_runtime": 0.1339,
+      "eval_samples_per_second": 2255.512,
+      "eval_steps_per_second": 141.903,
+      "step": 456
+    },
+    {
+      "epoch": 6.58,
+      "learning_rate": 1.1776315789473684e-05,
+      "loss": 0.1883,
+      "step": 500
+    },
+    {
+      "epoch": 7.0,
+      "eval_accuracy": 0.9768211920529801,
+      "eval_loss": 0.10020075738430023,
+      "eval_runtime": 0.145,
+      "eval_samples_per_second": 2083.04,
+      "eval_steps_per_second": 131.052,
+      "step": 532
+    },
+    {
+      "epoch": 8.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.116866335272789,
+      "eval_runtime": 0.1348,
+      "eval_samples_per_second": 2240.912,
+      "eval_steps_per_second": 140.985,
+      "step": 608
+    },
+    {
+      "epoch": 9.0,
+      "eval_accuracy": 0.9701986754966887,
+      "eval_loss": 0.14152054488658905,
+      "eval_runtime": 0.1308,
+      "eval_samples_per_second": 2309.736,
+      "eval_steps_per_second": 145.314,
+      "step": 684
+    },
+    {
+      "epoch": 10.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.1344088315963745,
+      "eval_runtime": 0.1195,
+      "eval_samples_per_second": 2526.256,
+      "eval_steps_per_second": 158.937,
+      "step": 760
+    },
+    {
+      "epoch": 11.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.13409321010112762,
+      "eval_runtime": 0.1399,
+      "eval_samples_per_second": 2159.267,
+      "eval_steps_per_second": 135.848,
+      "step": 836
+    },
+    {
+      "epoch": 12.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.12705937027931213,
+      "eval_runtime": 0.1366,
+      "eval_samples_per_second": 2210.321,
+      "eval_steps_per_second": 139.06,
+      "step": 912
+    },
+    {
+      "epoch": 13.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.13874845206737518,
+      "eval_runtime": 0.1374,
+      "eval_samples_per_second": 2197.254,
+      "eval_steps_per_second": 138.238,
+      "step": 988
+    },
+    {
+      "epoch": 13.16,
+      "learning_rate": 3.5526315789473687e-06,
+      "loss": 0.018,
+      "step": 1000
+    },
+    {
+      "epoch": 14.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.13716736435890198,
+      "eval_runtime": 0.1193,
+      "eval_samples_per_second": 2530.546,
+      "eval_steps_per_second": 159.207,
+      "step": 1064
+    },
+    {
+      "epoch": 15.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.13588877022266388,
+      "eval_runtime": 0.1396,
+      "eval_samples_per_second": 2163.789,
+      "eval_steps_per_second": 136.132,
+      "step": 1140
+    },
+    {
+      "epoch": 16.0,
+      "eval_accuracy": 0.9735099337748344,
+      "eval_loss": 0.13579562306404114,
+      "eval_runtime": 0.1288,
+      "eval_samples_per_second": 2345.226,
+      "eval_steps_per_second": 147.547,
+      "step": 1216
+    }
+  ],
+  "logging_steps": 500,
+  "max_steps": 1216,
+  "num_train_epochs": 16,
+  "save_steps": 500,
+  "total_flos": 62384098266840.0,
+  "trial_name": null,
+  "trial_params": null
+}

intent_classification_model/{checkpoint-324 → checkpoint-1216}/training_args.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c27308f0087e544f12e1806abafb33d65745a5791fb1559d9e521f3670215df9
 size 4536

 version https://git-lfs.github.com/spec/v1
+oid sha256:40b975e2b309584fec6c9097bbbfc4736c3bbe492681259866398911daf0ae0c
 size 4536

intent_classification_model/{checkpoint-324 → checkpoint-1216}/vocab.txt RENAMED Viewed

File without changes

intent_classification_model/checkpoint-1376/added_tokens.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "[CLS]": 101,
+  "[MASK]": 103,
+  "[PAD]": 0,
+  "[SEP]": 102,
+  "[UNK]": 100
+}

intent_classification_model/checkpoint-1376/config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Commercial",
+    "1": "Informational",
+    "2": "Navigational",
+    "3": "Local",
+    "4": "Transactional"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "Commercial": 0,
+    "Informational": 1,
+    "Local": 3,
+    "Navigational": 2,
+    "Transactional": 4
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.34.0",
+  "vocab_size": 30522
+}

intent_classification_model/checkpoint-1376/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7f2ed586c32f48dd2cece37baf89590cc951fda221ec175eadd3034e996abe25
+size 535745722

intent_classification_model/checkpoint-1376/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:153cb325de818e493f8a0a7aa1fbcc5cf3d8fa27d07339fbfd1d8e238d8cb38b
+size 267865194

intent_classification_model/{checkpoint-324 → checkpoint-1376}/rng_state.pth RENAMED Viewed

Binary files a/intent_classification_model/checkpoint-324/rng_state.pth and b/intent_classification_model/checkpoint-1376/rng_state.pth differ

intent_classification_model/checkpoint-1376/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c80c9f7b843dea09bd3b8739eafa7b84f67f346b13150be7548d804af238e2c
+size 1064

intent_classification_model/checkpoint-1376/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

intent_classification_model/checkpoint-1376/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

intent_classification_model/checkpoint-1376/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

intent_classification_model/checkpoint-1376/trainer_state.json ADDED Viewed

	@@ -0,0 +1,175 @@

+{
+  "best_metric": 0.10133440792560577,
+  "best_model_checkpoint": "intent_classification_model/checkpoint-344",
+  "epoch": 16.0,
+  "eval_steps": 500,
+  "global_step": 1376,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 1.0,
+      "eval_accuracy": 0.956140350877193,
+      "eval_loss": 0.24781915545463562,
+      "eval_runtime": 0.1669,
+      "eval_samples_per_second": 2049.176,
+      "eval_steps_per_second": 131.818,
+      "step": 86
+    },
+    {
+      "epoch": 2.0,
+      "eval_accuracy": 0.9766081871345029,
+      "eval_loss": 0.10303749144077301,
+      "eval_runtime": 0.2792,
+      "eval_samples_per_second": 1224.804,
+      "eval_steps_per_second": 78.789,
+      "step": 172
+    },
+    {
+      "epoch": 3.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.12486349791288376,
+      "eval_runtime": 0.1527,
+      "eval_samples_per_second": 2239.207,
+      "eval_steps_per_second": 144.043,
+      "step": 258
+    },
+    {
+      "epoch": 4.0,
+      "eval_accuracy": 0.9766081871345029,
+      "eval_loss": 0.10133440792560577,
+      "eval_runtime": 0.1513,
+      "eval_samples_per_second": 2260.581,
+      "eval_steps_per_second": 145.418,
+      "step": 344
+    },
+    {
+      "epoch": 5.0,
+      "eval_accuracy": 0.9766081871345029,
+      "eval_loss": 0.11906354874372482,
+      "eval_runtime": 0.1397,
+      "eval_samples_per_second": 2448.535,
+      "eval_steps_per_second": 157.508,
+      "step": 430
+    },
+    {
+      "epoch": 5.81,
+      "learning_rate": 1.2732558139534886e-05,
+      "loss": 0.1903,
+      "step": 500
+    },
+    {
+      "epoch": 6.0,
+      "eval_accuracy": 0.9678362573099415,
+      "eval_loss": 0.14922283589839935,
+      "eval_runtime": 0.1511,
+      "eval_samples_per_second": 2264.082,
+      "eval_steps_per_second": 145.643,
+      "step": 516
+    },
+    {
+      "epoch": 7.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.10685376077890396,
+      "eval_runtime": 0.1562,
+      "eval_samples_per_second": 2189.014,
+      "eval_steps_per_second": 140.814,
+      "step": 602
+    },
+    {
+      "epoch": 8.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.12596090137958527,
+      "eval_runtime": 0.1543,
+      "eval_samples_per_second": 2216.873,
+      "eval_steps_per_second": 142.606,
+      "step": 688
+    },
+    {
+      "epoch": 9.0,
+      "eval_accuracy": 0.9707602339181286,
+      "eval_loss": 0.129041388630867,
+      "eval_runtime": 0.1334,
+      "eval_samples_per_second": 2563.696,
+      "eval_steps_per_second": 164.916,
+      "step": 774
+    },
+    {
+      "epoch": 10.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.12375017255544662,
+      "eval_runtime": 0.1513,
+      "eval_samples_per_second": 2261.041,
+      "eval_steps_per_second": 145.447,
+      "step": 860
+    },
+    {
+      "epoch": 11.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.12813875079154968,
+      "eval_runtime": 0.1546,
+      "eval_samples_per_second": 2212.042,
+      "eval_steps_per_second": 142.295,
+      "step": 946
+    },
+    {
+      "epoch": 11.63,
+      "learning_rate": 5.465116279069767e-06,
+      "loss": 0.0258,
+      "step": 1000
+    },
+    {
+      "epoch": 12.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.13388033211231232,
+      "eval_runtime": 0.1607,
+      "eval_samples_per_second": 2128.444,
+      "eval_steps_per_second": 136.917,
+      "step": 1032
+    },
+    {
+      "epoch": 13.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.1308409869670868,
+      "eval_runtime": 0.1401,
+      "eval_samples_per_second": 2441.546,
+      "eval_steps_per_second": 157.058,
+      "step": 1118
+    },
+    {
+      "epoch": 14.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.13211463391780853,
+      "eval_runtime": 0.1539,
+      "eval_samples_per_second": 2222.296,
+      "eval_steps_per_second": 142.955,
+      "step": 1204
+    },
+    {
+      "epoch": 15.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.13366281986236572,
+      "eval_runtime": 0.1507,
+      "eval_samples_per_second": 2269.433,
+      "eval_steps_per_second": 145.987,
+      "step": 1290
+    },
+    {
+      "epoch": 16.0,
+      "eval_accuracy": 0.9736842105263158,
+      "eval_loss": 0.13524049520492554,
+      "eval_runtime": 0.1603,
+      "eval_samples_per_second": 2133.42,
+      "eval_steps_per_second": 137.238,
+      "step": 1376
+    }
+  ],
+  "logging_steps": 500,
+  "max_steps": 1376,
+  "num_train_epochs": 16,
+  "save_steps": 500,
+  "total_flos": 70181981180580.0,
+  "trial_name": null,
+  "trial_params": null
+}

intent_classification_model/checkpoint-1376/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d0b92fcfbb60dcd18505e69a8641e67a12b1dbb1bb4cf8cf1817bb473e3ed0dc
+size 4536

intent_classification_model/checkpoint-1376/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

intent_classification_model/checkpoint-324/trainer_state.json DELETED Viewed

@@ -1,73 +0,0 @@
-{
-  "best_metric": 0.16397738456726074,
-  "best_model_checkpoint": "intent_classification_model/checkpoint-270",
-  "epoch": 6.0,
-  "eval_steps": 500,
-  "global_step": 324,
-  "is_hyper_param_search": false,
-  "is_local_process_zero": true,
-  "is_world_process_zero": true,
-  "log_history": [
-    {
-      "epoch": 1.0,
-      "eval_accuracy": 0.9488372093023256,
-      "eval_loss": 0.4676927328109741,
-      "eval_runtime": 0.1185,
-      "eval_samples_per_second": 1814.083,
-      "eval_steps_per_second": 118.126,
-      "step": 54
-    },
-    {
-      "epoch": 2.0,
-      "eval_accuracy": 0.9534883720930233,
-      "eval_loss": 0.20428764820098877,
-      "eval_runtime": 0.0972,
-      "eval_samples_per_second": 2210.83,
-      "eval_steps_per_second": 143.961,
-      "step": 108
-    },
-    {
-      "epoch": 3.0,
-      "eval_accuracy": 0.9674418604651163,
-      "eval_loss": 0.16401757299900055,
-      "eval_runtime": 0.1015,
-      "eval_samples_per_second": 2118.828,
-      "eval_steps_per_second": 137.97,
-      "step": 162
-    },
-    {
-      "epoch": 4.0,
-      "eval_accuracy": 0.9674418604651163,
-      "eval_loss": 0.16496841609477997,
-      "eval_runtime": 0.0941,
-      "eval_samples_per_second": 2284.398,
-      "eval_steps_per_second": 148.752,
-      "step": 216
-    },
-    {
-      "epoch": 5.0,
-      "eval_accuracy": 0.9674418604651163,
-      "eval_loss": 0.16397738456726074,
-      "eval_runtime": 0.0975,
-      "eval_samples_per_second": 2204.851,
-      "eval_steps_per_second": 143.572,
-      "step": 270
-    },
-    {
-      "epoch": 6.0,
-      "eval_accuracy": 0.9674418604651163,
-      "eval_loss": 0.16553252935409546,
-      "eval_runtime": 0.0947,
-      "eval_samples_per_second": 2271.063,
-      "eval_steps_per_second": 147.883,
-      "step": 324
-    }
-  ],
-  "logging_steps": 500,
-  "max_steps": 324,
-  "num_train_epochs": 6,
-  "save_steps": 500,
-  "total_flos": 13032177536640.0,
-  "trial_name": null,
-  "trial_params": null
-}

intent_classification_model/runs/Oct13_10-35-17_ip-172-31-95-165/events.out.tfevents.1697193318.ip-172-31-95-165.139816.0 ADDED Viewed

Binary file (10.2 kB). View file

intent_classification_model/runs/Oct13_10-49-20_ip-172-31-95-165/events.out.tfevents.1697194161.ip-172-31-95-165.140238.0 ADDED Viewed

Binary file (10.2 kB). View file

research/09_fine_tuning_for_datacategories.ipynb CHANGED Viewed

@@ -62,93 +62,93 @@
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
-       "      <th>22910</th>\n",
-       "      <td>Retirement income streams explanation</td>\n",
-       "      <td>Finance</td>\n",
-       "      <td>18</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>3202</th>\n",
-       "      <td>Social justice strategies</td>\n",
-       "      <td>People_and_Society</td>\n",
-       "      <td>10</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>23191</th>\n",
-       "      <td>Nanomaterials engineering</td>\n",
        "      <td>Science</td>\n",
        "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>25025</th>\n",
-       "      <td>Acrylic nails</td>\n",
-       "      <td>Beauty_and_Fitness</td>\n",
-       "      <td>9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>14018</th>\n",
-       "      <td>Substance abuse recovery strategies</td>\n",
-       "      <td>People_and_Society</td>\n",
-       "      <td>10</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>30887</th>\n",
-       "      <td>Facebook privacy</td>\n",
-       "      <td>Online Communities</td>\n",
-       "      <td>8</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>5716</th>\n",
-       "      <td>disability</td>\n",
-       "      <td>Sensitive Subjects</td>\n",
-       "      <td>23</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>25854</th>\n",
-       "      <td>Zumba dance fitness</td>\n",
-       "      <td>Beauty_and_Fitness</td>\n",
-       "      <td>9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>25032</th>\n",
-       "      <td>Enjoy dick porn</td>\n",
-       "      <td>Adult</td>\n",
-       "      <td>6</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>2008</th>\n",
-       "      <td>iPhone Face ID</td>\n",
-       "      <td>Computers_and_Electronics</td>\n",
-       "      <td>7</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "                                    category                      label  \\\n",
-       "22910  Retirement income streams explanation                    Finance   \n",
-       "3202               Social justice strategies         People_and_Society   \n",
-       "23191              Nanomaterials engineering                    Science   \n",
-       "25025                          Acrylic nails         Beauty_and_Fitness   \n",
-       "14018    Substance abuse recovery strategies         People_and_Society   \n",
-       "30887                       Facebook privacy         Online Communities   \n",
-       "5716                              disability         Sensitive Subjects   \n",
-       "25854                    Zumba dance fitness         Beauty_and_Fitness   \n",
-       "25032                        Enjoy dick porn                      Adult   \n",
-       "2008                          iPhone Face ID  Computers_and_Electronics   \n",
        "\n",
-       "       label_id  \n",
-       "22910        18  \n",
-       "3202         10  \n",
-       "23191         2  \n",
-       "25025         9  \n",
-       "14018        10  \n",
-       "30887         8  \n",
-       "5716         23  \n",
-       "25854         9  \n",
-       "25032         6  \n",
-       "2008          7  "
       ]
      },
      "execution_count": 3,
@@ -273,7 +273,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_129502/984288843.py:1: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -307,71 +307,71 @@
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
-       "      <th>7152</th>\n",
-       "      <td>Social justice strategies</td>\n",
-       "      <td>10</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>31780</th>\n",
-       "      <td>LinkedIn job search for food writing organizat...</td>\n",
-       "      <td>21</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>20244</th>\n",
-       "      <td>Nobel Prize in Literature news</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>16634</th>\n",
-       "      <td>Job search for people with public health impai...</td>\n",
-       "      <td>21</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>8603</th>\n",
-       "      <td>Car insurance for luxury cars</td>\n",
-       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>30042</th>\n",
-       "      <td>Personal development and self-help techniques ...</td>\n",
-       "      <td>8</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>9345</th>\n",
-       "      <td>Smartwatch features</td>\n",
-       "      <td>7</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>19660</th>\n",
-       "      <td>Travel deals for beachfront chalets</td>\n",
-       "      <td>14</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>27349</th>\n",
-       "      <td>Choosing energy-efficient HVAC</td>\n",
-       "      <td>20</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>12660</th>\n",
-       "      <td>Advocacy for native land rights</td>\n",
-       "      <td>10</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "                                                    text  label\n",
-       "7152                           Social justice strategies     10\n",
-       "31780  LinkedIn job search for food writing organizat...     21\n",
-       "20244                     Nobel Prize in Literature news      1\n",
-       "16634  Job search for people with public health impai...     21\n",
-       "8603                       Car insurance for luxury cars      3\n",
-       "30042  Personal development and self-help techniques ...      8\n",
-       "9345                                 Smartwatch features      7\n",
-       "19660                Travel deals for beachfront chalets     14\n",
-       "27349                     Choosing energy-efficient HVAC     20\n",
-       "12660                    Advocacy for native land rights     10"
       ]
      },
      "execution_count": 6,
@@ -483,8 +483,15 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Map: 100%|██████████| 26889/26889 [00:00<00:00, 33262.24 examples/s]\n",
-      "Map: 100%|██████████| 6723/6723 [00:00<00:00, 42992.17 examples/s]\n"
      ]
     }
    ],
@@ -501,9 +508,9 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "2023-10-12 11:59:02.472987: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
       "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
-      "2023-10-12 11:59:03.211664: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
      ]
     }
    ],
@@ -686,7 +693,7 @@
        "    <div>\n",
        "      \n",
        "      <progress value='3362' max='3362' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
-       "      [3362/3362 01:46, Epoch 2/2]\n",
        "    </div>\n",
        "    <table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
@@ -700,15 +707,15 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <td>1</td>\n",
-       "      <td>0.104100</td>\n",
-       "      <td>0.102964</td>\n",
-       "      <td>0.972185</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>2</td>\n",
-       "      <td>0.077300</td>\n",
-       "      <td>0.110562</td>\n",
-       "      <td>0.970846</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table><p>"
@@ -723,7 +730,7 @@
     {
      "data": {
       "text/plain": [
-       "TrainOutput(global_step=3362, training_loss=0.08810693149691462, metrics={'train_runtime': 106.8757, 'train_samples_per_second': 503.183, 'train_steps_per_second': 31.457, 'total_flos': 216609059710134.0, 'train_loss': 0.08810693149691462, 'epoch': 2.0})"
       ]
      },
      "execution_count": 19,

        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
+       "      <th>3982</th>\n",
+       "      <td>Citation context relevance assessment platforms</td>\n",
+       "      <td>Reference</td>\n",
+       "      <td>12</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>24651</th>\n",
+       "      <td>Geology fieldwork</td>\n",
        "      <td>Science</td>\n",
        "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>28113</th>\n",
+       "      <td>Password management for individuals</td>\n",
+       "      <td>Computers_and_Electronics</td>\n",
+       "      <td>7</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>10999</th>\n",
+       "      <td>Real estate market statistics</td>\n",
+       "      <td>Real Estate</td>\n",
+       "      <td>24</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>17096</th>\n",
+       "      <td>Running gear for women</td>\n",
+       "      <td>Beauty_and_Fitness</td>\n",
+       "      <td>9</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>2374</th>\n",
+       "      <td>Sports Team Fan Pride</td>\n",
+       "      <td>Sports</td>\n",
+       "      <td>26</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>9932</th>\n",
+       "      <td>Wine and food events</td>\n",
+       "      <td>Food_and_Drink</td>\n",
+       "      <td>15</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>2953</th>\n",
+       "      <td>College admissions for aspiring dancers</td>\n",
+       "      <td>Jobs_and_Education</td>\n",
+       "      <td>21</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>25038</th>\n",
+       "      <td>Software development best practices forums</td>\n",
+       "      <td>Online Communities</td>\n",
+       "      <td>8</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>29703</th>\n",
+       "      <td>Quantum physics theories</td>\n",
+       "      <td>Science</td>\n",
+       "      <td>2</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
+       "                                              category  \\\n",
+       "3982   Citation context relevance assessment platforms   \n",
+       "24651                                Geology fieldwork   \n",
+       "28113              Password management for individuals   \n",
+       "10999                    Real estate market statistics   \n",
+       "17096                           Running gear for women   \n",
+       "2374                             Sports Team Fan Pride   \n",
+       "9932                              Wine and food events   \n",
+       "2953           College admissions for aspiring dancers   \n",
+       "25038       Software development best practices forums   \n",
+       "29703                         Quantum physics theories   \n",
        "\n",
+       "                           label  label_id  \n",
+       "3982                   Reference        12  \n",
+       "24651                    Science         2  \n",
+       "28113  Computers_and_Electronics         7  \n",
+       "10999                Real Estate        24  \n",
+       "17096         Beauty_and_Fitness         9  \n",
+       "2374                      Sports        26  \n",
+       "9932              Food_and_Drink        15  \n",
+       "2953          Jobs_and_Education        21  \n",
+       "25038         Online Communities         8  \n",
+       "29703                    Science         2  "
       ]
      },
      "execution_count": 3,
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "/tmp/ipykernel_139501/984288843.py:1: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
+       "      <th>2925</th>\n",
+       "      <td>Kids' toy stores online</td>\n",
+       "      <td>13</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>31108</th>\n",
+       "      <td>Birdwatching apps for bird behavior</td>\n",
+       "      <td>5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>6817</th>\n",
+       "      <td>Legal developments</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>20037</th>\n",
+       "      <td>Citation context relevance assessment tools</td>\n",
+       "      <td>12</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>18928</th>\n",
+       "      <td>Orchid care guide</td>\n",
+       "      <td>20</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>33358</th>\n",
+       "      <td>Scientific publications and journals</td>\n",
+       "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>16499</th>\n",
+       "      <td>Service dog etiquette</td>\n",
+       "      <td>5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>26484</th>\n",
+       "      <td>Social media trends analysis</td>\n",
+       "      <td>25</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>15543</th>\n",
+       "      <td>Troubleshooting computer issues</td>\n",
+       "      <td>7</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>15854</th>\n",
+       "      <td>large</td>\n",
+       "      <td>23</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
+       "                                              text  label\n",
+       "2925                       Kids' toy stores online     13\n",
+       "31108          Birdwatching apps for bird behavior      5\n",
+       "6817                            Legal developments      1\n",
+       "20037  Citation context relevance assessment tools     12\n",
+       "18928                            Orchid care guide     20\n",
+       "33358         Scientific publications and journals      2\n",
+       "16499                        Service dog etiquette      5\n",
+       "26484                 Social media trends analysis     25\n",
+       "15543              Troubleshooting computer issues      7\n",
+       "15854                                        large     23"
       ]
      },
      "execution_count": 6,
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "Map:  48%|████▊     | 13000/26889 [00:00<00:00, 32226.42 examples/s]"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Map: 100%|██████████| 26889/26889 [00:00<00:00, 34388.34 examples/s]\n",
+      "Map: 100%|██████████| 6723/6723 [00:00<00:00, 41978.69 examples/s]\n"
      ]
     }
    ],
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "2023-10-13 10:29:49.212220: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
       "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2023-10-13 10:29:50.573292: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
      ]
     }
    ],
        "    <div>\n",
        "      \n",
        "      <progress value='3362' max='3362' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [3362/3362 01:52, Epoch 2/2]\n",
        "    </div>\n",
        "    <table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <td>1</td>\n",
+       "      <td>0.102300</td>\n",
+       "      <td>0.077652</td>\n",
+       "      <td>0.975309</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>2</td>\n",
+       "      <td>0.083400</td>\n",
+       "      <td>0.086291</td>\n",
+       "      <td>0.974714</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table><p>"
     {
      "data": {
       "text/plain": [
+       "TrainOutput(global_step=3362, training_loss=0.08880683540376008, metrics={'train_runtime': 113.5357, 'train_samples_per_second': 473.666, 'train_steps_per_second': 29.612, 'total_flos': 213673546900476.0, 'train_loss': 0.08880683540376008, 'epoch': 2.0})"
       ]
      },
      "execution_count": 19,

research/11_evaluation.ipynb CHANGED Viewed

@@ -13,7 +13,17 @@
    "cell_type": "code",
    "execution_count": 2,
    "metadata": {},
-   "outputs": [],
    "source": [
     "from utils.get_intent import get_top_intent"
    ]
@@ -26,11 +36,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Commercial', 0.969),\n",
-       " ('Transactional', 0.673),\n",
-       " ('Informational', 0.237),\n",
-       " ('Navigational', 0.215),\n",
-       " ('Local', 0.155)]"
       ]
      },
      "execution_count": 3,
@@ -50,11 +60,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Transactional', 0.987),\n",
-       " ('Navigational', 0.317),\n",
-       " ('Commercial', 0.27),\n",
-       " ('Informational', 0.249),\n",
-       " ('Local', 0.229)]"
       ]
      },
      "execution_count": 4,
@@ -74,11 +84,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Informational', 0.984),\n",
-       " ('Local', 0.244),\n",
-       " ('Commercial', 0.237),\n",
-       " ('Transactional', 0.212),\n",
-       " ('Navigational', 0.194)]"
       ]
      },
      "execution_count": 5,
@@ -98,11 +108,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Local', 0.988),\n",
-       " ('Informational', 0.3),\n",
-       " ('Commercial', 0.278),\n",
-       " ('Navigational', 0.273),\n",
-       " ('Transactional', 0.234)]"
       ]
      },
      "execution_count": 6,
@@ -122,11 +132,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Informational', 0.763),\n",
-       " ('Navigational', 0.638),\n",
-       " ('Transactional', 0.433),\n",
-       " ('Commercial', 0.286),\n",
-       " ('Local', 0.236)]"
       ]
      },
      "execution_count": 7,
@@ -146,11 +156,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Navigational', 0.861),\n",
-       " ('Transactional', 0.725),\n",
-       " ('Local', 0.422),\n",
-       " ('Commercial', 0.287),\n",
-       " ('Informational', 0.202)]"
       ]
      },
      "execution_count": 8,
@@ -170,11 +180,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Navigational', 0.983),\n",
-       " ('Transactional', 0.27),\n",
-       " ('Local', 0.23),\n",
-       " ('Informational', 0.209),\n",
-       " ('Commercial', 0.192)]"
       ]
      },
      "execution_count": 9,
@@ -194,11 +204,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Navigational', 0.983),\n",
        " ('Transactional', 0.256),\n",
-       " ('Informational', 0.241),\n",
-       " ('Local', 0.214),\n",
-       " ('Commercial', 0.184)]"
       ]
      },
      "execution_count": 10,
@@ -218,11 +228,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Local', 0.988),\n",
-       " ('Informational', 0.294),\n",
-       " ('Navigational', 0.284),\n",
-       " ('Commercial', 0.252),\n",
-       " ('Transactional', 0.235)]"
       ]
      },
      "execution_count": 11,
@@ -242,11 +252,11 @@
     {
      "data": {
       "text/plain": [
-       "[('Informational', 0.984),\n",
-       " ('Local', 0.245),\n",
-       " ('Commercial', 0.242),\n",
-       " ('Transactional', 0.226),\n",
-       " ('Navigational', 0.189)]"
       ]
      },
      "execution_count": 12,
@@ -258,6 +268,204 @@
     "get_top_intent(\"how to wear headphones\")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,

    "cell_type": "code",
    "execution_count": 2,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ubuntu/SentenceStructureComparision/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+     ]
+    }
+   ],
    "source": [
     "from utils.get_intent import get_top_intent"
    ]
     {
      "data": {
       "text/plain": [
+       "[('Commercial', 0.997),\n",
+       " ('Transactional', 0.199),\n",
+       " ('Local', 0.132),\n",
+       " ('Navigational', 0.099),\n",
+       " ('Informational', 0.088)]"
       ]
      },
      "execution_count": 3,
     {
      "data": {
       "text/plain": [
+       "[('Transactional', 0.996),\n",
+       " ('Commercial', 0.315),\n",
+       " ('Navigational', 0.149),\n",
+       " ('Local', 0.146),\n",
+       " ('Informational', 0.133)]"
       ]
      },
      "execution_count": 4,
     {
      "data": {
       "text/plain": [
+       "[('Informational', 0.999),\n",
+       " ('Transactional', 0.116),\n",
+       " ('Local', 0.094),\n",
+       " ('Commercial', 0.075),\n",
+       " ('Navigational', 0.075)]"
       ]
      },
      "execution_count": 5,
     {
      "data": {
       "text/plain": [
+       "[('Local', 0.997),\n",
+       " ('Commercial', 0.134),\n",
+       " ('Informational', 0.122),\n",
+       " ('Navigational', 0.121),\n",
+       " ('Transactional', 0.12)]"
       ]
      },
      "execution_count": 6,
     {
      "data": {
       "text/plain": [
+       "[('Informational', 0.892),\n",
+       " ('Transactional', 0.685),\n",
+       " ('Navigational', 0.533),\n",
+       " ('Commercial', 0.123),\n",
+       " ('Local', 0.072)]"
       ]
      },
      "execution_count": 7,
     {
      "data": {
       "text/plain": [
+       "[('Informational', 0.993),\n",
+       " ('Commercial', 0.183),\n",
+       " ('Transactional', 0.173),\n",
+       " ('Local', 0.123),\n",
+       " ('Navigational', 0.082)]"
       ]
      },
      "execution_count": 8,
     {
      "data": {
       "text/plain": [
+       "[('Navigational', 0.998),\n",
+       " ('Transactional', 0.271),\n",
+       " ('Local', 0.164),\n",
+       " ('Commercial', 0.134),\n",
+       " ('Informational', 0.129)]"
       ]
      },
      "execution_count": 9,
     {
      "data": {
       "text/plain": [
+       "[('Navigational', 0.998),\n",
        " ('Transactional', 0.256),\n",
+       " ('Local', 0.171),\n",
+       " ('Informational', 0.151),\n",
+       " ('Commercial', 0.127)]"
       ]
      },
      "execution_count": 10,
     {
      "data": {
       "text/plain": [
+       "[('Local', 0.997),\n",
+       " ('Commercial', 0.136),\n",
+       " ('Transactional', 0.124),\n",
+       " ('Informational', 0.119),\n",
+       " ('Navigational', 0.118)]"
       ]
      },
      "execution_count": 11,
     {
      "data": {
       "text/plain": [
+       "[('Informational', 0.999),\n",
+       " ('Transactional', 0.131),\n",
+       " ('Local', 0.09),\n",
+       " ('Commercial', 0.072),\n",
+       " ('Navigational', 0.069)]"
       ]
      },
      "execution_count": 12,
     "get_top_intent(\"how to wear headphones\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Navigational', 0.997),\n",
+       " ('Transactional', 0.452),\n",
+       " ('Local', 0.127),\n",
+       " ('Informational', 0.126),\n",
+       " ('Commercial', 0.12)]"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent(\"receiptify\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Transactional', 0.995),\n",
+       " ('Commercial', 0.27),\n",
+       " ('Informational', 0.181),\n",
+       " ('Local', 0.162),\n",
+       " ('Navigational', 0.133)]"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent(\"cat ear headphones\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Transactional', 0.977),\n",
+       " ('Navigational', 0.808),\n",
+       " ('Commercial', 0.254),\n",
+       " ('Informational', 0.107),\n",
+       " ('Local', 0.081)]"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent(\"sony headphones guide\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Navigational', 0.949),\n",
+       " ('Transactional', 0.89),\n",
+       " ('Informational', 0.328),\n",
+       " ('Commercial', 0.113),\n",
+       " ('Local', 0.069)]"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent(\"wolf cut\") # informational"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Transactional', 0.996),\n",
+       " ('Commercial', 0.217),\n",
+       " ('Informational', 0.199),\n",
+       " ('Navigational', 0.17),\n",
+       " ('Local', 0.136)]"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent(\"help plumbing supply\") # informational"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Informational', 0.969),\n",
+       " ('Commercial', 0.677),\n",
+       " ('Transactional', 0.276),\n",
+       " ('Local', 0.071),\n",
+       " ('Navigational', 0.035)]"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_intent('yoga purpose') # informational"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os; os.chdir('..')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils.get_category import get_top_labels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Computers_and_Electronics', 1.0), ('Shopping', 0.182)]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_labels(\n",
+    "    \"best cat ear headphones\"\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

research/11_intent_classification_using_distilbert.ipynb CHANGED Viewed

@@ -20,7 +20,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -87,7 +87,7 @@
        "4                        tech crunch   Navigational"
       ]
      },
-     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -99,7 +99,59 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -108,7 +160,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -121,7 +173,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
@@ -134,7 +186,7 @@
        " 4: 'Transactional'}"
       ]
      },
-     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -145,7 +197,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -158,7 +210,7 @@
        " 'Transactional': 4}"
       ]
      },
-     "execution_count": 7,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -169,7 +221,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -179,7 +231,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -246,58 +298,58 @@
        "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1066</th>\n",
-       "      <td>How to make a paper flower?</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1067</th>\n",
-       "      <td>Why do some animals camouflage?</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1068</th>\n",
-       "      <td>What is the history of ancient civilizations?</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1069</th>\n",
-       "      <td>How to make a simple machine?</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1070</th>\n",
-       "      <td>Why do we see the phases of the moon?</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
-       "<p>1071 rows × 3 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
-       "                                            keyword         intent  id\n",
-       "0                              citalopram vs prozac     Commercial   0\n",
-       "1                 who is the oldest football player  Informational   1\n",
-       "2                                t mobile town east   Navigational   2\n",
-       "3                                         starbucks   Navigational   2\n",
-       "4                                       tech crunch   Navigational   2\n",
-       "...                                             ...            ...  ..\n",
-       "1066                    How to make a paper flower?  Informational   1\n",
-       "1067                Why do some animals camouflage?  Informational   1\n",
-       "1068  What is the history of ancient civilizations?  Informational   1\n",
-       "1069                  How to make a simple machine?  Informational   1\n",
-       "1070          Why do we see the phases of the moon?  Informational   1\n",
        "\n",
-       "[1071 rows x 3 columns]"
       ]
      },
-     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -309,7 +361,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -369,53 +421,53 @@
        "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1066</th>\n",
-       "      <td>How to make a paper flower?</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1067</th>\n",
-       "      <td>Why do some animals camouflage?</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1068</th>\n",
-       "      <td>What is the history of ancient civilizations?</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1069</th>\n",
-       "      <td>How to make a simple machine?</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>1070</th>\n",
-       "      <td>Why do we see the phases of the moon?</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
-       "<p>1071 rows × 2 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
-       "                                            keyword  id\n",
-       "0                              citalopram vs prozac   0\n",
-       "1                 who is the oldest football player   1\n",
-       "2                                t mobile town east   2\n",
-       "3                                         starbucks   2\n",
-       "4                                       tech crunch   2\n",
-       "...                                             ...  ..\n",
-       "1066                    How to make a paper flower?   1\n",
-       "1067                Why do some animals camouflage?   1\n",
-       "1068  What is the history of ancient civilizations?   1\n",
-       "1069                  How to make a simple machine?   1\n",
-       "1070          Why do we see the phases of the moon?   1\n",
        "\n",
-       "[1071 rows x 2 columns]"
       ]
      },
-     "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -427,7 +479,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
    "metadata": {},
    "outputs": [
     {
@@ -445,14 +497,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_138160/1635098052.py:1: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -486,74 +538,74 @@
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
-       "      <th>706</th>\n",
-       "      <td>Purchase DJ equipment</td>\n",
        "      <td>4</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>24</th>\n",
-       "      <td>best headphones quora</td>\n",
-       "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>727</th>\n",
-       "      <td>Purchase fitness tracker</td>\n",
        "      <td>4</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>17</th>\n",
-       "      <td>facebook</td>\n",
-       "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>808</th>\n",
-       "      <td>Outdoor activities in Lake Tahoe</td>\n",
-       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>946</th>\n",
-       "      <td>Wine bars in Napa Valley</td>\n",
-       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>944</th>\n",
-       "      <td>Art installations in Chicago</td>\n",
-       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>899</th>\n",
-       "      <td>Snowboarding parks in Utah</td>\n",
-       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>36</th>\n",
-       "      <td>Mission Immpossible</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
-       "      <th>129</th>\n",
-       "      <td>Instagram</td>\n",
-       "      <td>2</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "                                 text  label\n",
-       "706             Purchase DJ equipment      4\n",
-       "24              best headphones quora      2\n",
-       "727          Purchase fitness tracker      4\n",
-       "17                           facebook      2\n",
-       "808  Outdoor activities in Lake Tahoe      3\n",
-       "946          Wine bars in Napa Valley      3\n",
-       "944      Art installations in Chicago      3\n",
-       "899        Snowboarding parks in Utah      3\n",
-       "36                Mission Immpossible      1\n",
-       "129                         Instagram      2"
       ]
      },
-     "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -571,7 +623,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
    "metadata": {},
    "outputs": [
     {
@@ -586,12 +638,12 @@
      "data": {
       "text/plain": [
        "Dataset({\n",
-       "    features: ['text', 'label'],\n",
-       "    num_rows: 1071\n",
        "})"
       ]
      },
-     "execution_count": 13,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -603,7 +655,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -611,17 +663,17 @@
       "text/plain": [
        "DatasetDict({\n",
        "    train: Dataset({\n",
-       "        features: ['text', 'label'],\n",
-       "        num_rows: 856\n",
        "    })\n",
        "    test: Dataset({\n",
-       "        features: ['text', 'label'],\n",
-       "        num_rows: 215\n",
        "    })\n",
        "})"
       ]
      },
-     "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -633,7 +685,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -644,7 +696,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -654,15 +706,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Map: 100%|██████████| 856/856 [00:00<00:00, 18779.12 examples/s]\n",
-      "Map: 100%|██████████| 215/215 [00:00<00:00, 27520.84 examples/s]\n"
      ]
     }
    ],
@@ -672,16 +724,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "2023-10-13 09:10:00.122326: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
       "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
-      "2023-10-13 09:10:01.611782: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
      ]
     }
    ],
@@ -700,7 +752,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -711,7 +763,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -726,14 +778,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']\n",
       "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
      ]
     }
@@ -748,7 +800,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -764,8 +816,8 @@
        "\n",
        "    <div>\n",
        "      \n",
-       "      <progress value='324' max='324' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
-       "      [324/324 00:39, Epoch 6/6]\n",
        "    </div>\n",
        "    <table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
@@ -780,38 +832,98 @@
        "    <tr>\n",
        "      <td>1</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.467693</td>\n",
-       "      <td>0.948837</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>2</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.204288</td>\n",
-       "      <td>0.953488</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>3</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.164018</td>\n",
-       "      <td>0.967442</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>4</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.164968</td>\n",
-       "      <td>0.967442</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>5</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.163977</td>\n",
-       "      <td>0.967442</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>6</td>\n",
        "      <td>No log</td>\n",
-       "      <td>0.165533</td>\n",
-       "      <td>0.967442</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table><p>"
@@ -826,10 +938,10 @@
     {
      "data": {
       "text/plain": [
-       "TrainOutput(global_step=324, training_loss=0.2842947171058184, metrics={'train_runtime': 40.8212, 'train_samples_per_second': 125.817, 'train_steps_per_second': 7.937, 'total_flos': 13032177536640.0, 'train_loss': 0.2842947171058184, 'epoch': 6.0})"
       ]
      },
-     "execution_count": 22,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -840,7 +952,7 @@
     "    learning_rate=2e-5,\n",
     "    per_device_train_batch_size=16,\n",
     "    per_device_eval_batch_size=16,\n",
-    "    num_train_epochs=6,\n",
     "    weight_decay=0.01,\n",
     "    evaluation_strategy=\"epoch\",\n",
     "    save_strategy=\"epoch\",\n",

   },
   {
    "cell_type": "code",
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
        "4                        tech crunch   Navigational"
       ]
      },
+     "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "False    1506\n",
+       "True      202\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "original_df.duplicated().value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "original_df.drop_duplicates(inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "False    1506\n",
+       "Name: count, dtype: int64"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "original_df.duplicated().value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
        " 4: 'Transactional'}"
       ]
      },
+     "execution_count": 21,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
        " 'Transactional': 4}"
       ]
      },
+     "execution_count": 22,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 23,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 24,
    "metadata": {},
    "outputs": [
     {
        "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1703</th>\n",
+       "      <td>How to make homemade pet accessories from recy...</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1704</th>\n",
+       "      <td>Top 10 science fiction book series that take r...</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1705</th>\n",
+       "      <td>How to start a car restoration and customizati...</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1706</th>\n",
+       "      <td>Ancient Mesopotamian architecture and its infl...</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1707</th>\n",
+       "      <td>Benefits of a flexitarian diet for those seeki...</td>\n",
        "      <td>Informational</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
+       "<p>1506 rows × 3 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
+       "                                                keyword         intent  id\n",
+       "0                                  citalopram vs prozac     Commercial   0\n",
+       "1                     who is the oldest football player  Informational   1\n",
+       "2                                    t mobile town east   Navigational   2\n",
+       "3                                             starbucks   Navigational   2\n",
+       "4                                           tech crunch   Navigational   2\n",
+       "...                                                 ...            ...  ..\n",
+       "1703  How to make homemade pet accessories from recy...  Informational   1\n",
+       "1704  Top 10 science fiction book series that take r...  Informational   1\n",
+       "1705  How to start a car restoration and customizati...  Informational   1\n",
+       "1706  Ancient Mesopotamian architecture and its infl...  Informational   1\n",
+       "1707  Benefits of a flexitarian diet for those seeki...  Informational   1\n",
        "\n",
+       "[1506 rows x 3 columns]"
       ]
      },
+     "execution_count": 24,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
        "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1703</th>\n",
+       "      <td>How to make homemade pet accessories from recy...</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1704</th>\n",
+       "      <td>Top 10 science fiction book series that take r...</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1705</th>\n",
+       "      <td>How to start a car restoration and customizati...</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1706</th>\n",
+       "      <td>Ancient Mesopotamian architecture and its infl...</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1707</th>\n",
+       "      <td>Benefits of a flexitarian diet for those seeki...</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
+       "<p>1506 rows × 2 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
+       "                                                keyword  id\n",
+       "0                                  citalopram vs prozac   0\n",
+       "1                     who is the oldest football player   1\n",
+       "2                                    t mobile town east   2\n",
+       "3                                             starbucks   2\n",
+       "4                                           tech crunch   2\n",
+       "...                                                 ...  ..\n",
+       "1703  How to make homemade pet accessories from recy...   1\n",
+       "1704  Top 10 science fiction book series that take r...   1\n",
+       "1705  How to start a car restoration and customizati...   1\n",
+       "1706  Ancient Mesopotamian architecture and its infl...   1\n",
+       "1707  Benefits of a flexitarian diet for those seeki...   1\n",
        "\n",
+       "[1506 rows x 2 columns]"
       ]
      },
+     "execution_count": 25,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 26,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 27,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "/tmp/ipykernel_140238/1635098052.py:1: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
+       "      <th>26</th>\n",
+       "      <td>Iphone 13 prices</td>\n",
        "      <td>4</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1604</th>\n",
+       "      <td>Basics of string theory and its applications</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>622</th>\n",
+       "      <td>Purchase air purifier</td>\n",
        "      <td>4</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>841</th>\n",
+       "      <td>Art studios in Asheville</td>\n",
+       "      <td>3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1504</th>\n",
+       "      <td>What is epigenetic inheritance?</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>311</th>\n",
+       "      <td>Target Business login</td>\n",
+       "      <td>2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>61</th>\n",
+       "      <td>How to get Spotify Premium</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>980</th>\n",
+       "      <td>How to meditate?</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1428</th>\n",
+       "      <td>Basics of black holes</td>\n",
        "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
+       "      <th>1266</th>\n",
+       "      <td>Ancient Chinese dynasties</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
+       "                                              text  label\n",
+       "26                                Iphone 13 prices      4\n",
+       "1604  Basics of string theory and its applications      1\n",
+       "622                          Purchase air purifier      4\n",
+       "841                       Art studios in Asheville      3\n",
+       "1504               What is epigenetic inheritance?      1\n",
+       "311                          Target Business login      2\n",
+       "61                      How to get Spotify Premium      1\n",
+       "980                               How to meditate?      1\n",
+       "1428                         Basics of black holes      1\n",
+       "1266                     Ancient Chinese dynasties      1"
       ]
      },
+     "execution_count": 27,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 28,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "Dataset({\n",
+       "    features: ['text', 'label', '__index_level_0__'],\n",
+       "    num_rows: 1506\n",
        "})"
       ]
      },
+     "execution_count": 28,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 29,
    "metadata": {},
    "outputs": [
     {
       "text/plain": [
        "DatasetDict({\n",
        "    train: Dataset({\n",
+       "        features: ['text', 'label', '__index_level_0__'],\n",
+       "        num_rows: 1204\n",
        "    })\n",
        "    test: Dataset({\n",
+       "        features: ['text', 'label', '__index_level_0__'],\n",
+       "        num_rows: 302\n",
        "    })\n",
        "})"
       ]
      },
+     "execution_count": 29,
      "metadata": {},
      "output_type": "execute_result"
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 30,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 31,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 32,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "Map: 100%|██████████| 1204/1204 [00:00<00:00, 14009.91 examples/s]\n",
+      "Map: 100%|██████████| 302/302 [00:00<00:00, 24935.62 examples/s]\n"
      ]
     }
    ],
   },
   {
    "cell_type": "code",
+   "execution_count": 33,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "2023-10-13 10:49:11.199157: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
       "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2023-10-13 10:49:12.962522: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
      ]
     }
    ],
   },
   {
    "cell_type": "code",
+   "execution_count": 34,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 35,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 36,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
+      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']\n",
       "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
      ]
     }
   },
   {
    "cell_type": "code",
+   "execution_count": 37,
    "metadata": {},
    "outputs": [
     {
        "\n",
        "    <div>\n",
        "      \n",
+       "      <progress value='1216' max='1216' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [1216/1216 02:51, Epoch 16/16]\n",
        "    </div>\n",
        "    <table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr>\n",
        "      <td>1</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.208865</td>\n",
+       "      <td>0.986755</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>2</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.062759</td>\n",
+       "      <td>0.983444</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>3</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.065099</td>\n",
+       "      <td>0.986755</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>4</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.081124</td>\n",
+       "      <td>0.976821</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>5</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.112577</td>\n",
+       "      <td>0.970199</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td>6</td>\n",
        "      <td>No log</td>\n",
+       "      <td>0.111743</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>7</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.100201</td>\n",
+       "      <td>0.976821</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>8</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.116866</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>9</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.141521</td>\n",
+       "      <td>0.970199</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>10</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.134409</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>11</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.134093</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>12</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.127059</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>13</td>\n",
+       "      <td>0.188300</td>\n",
+       "      <td>0.138748</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>14</td>\n",
+       "      <td>0.018000</td>\n",
+       "      <td>0.137167</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>15</td>\n",
+       "      <td>0.018000</td>\n",
+       "      <td>0.135889</td>\n",
+       "      <td>0.973510</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>16</td>\n",
+       "      <td>0.018000</td>\n",
+       "      <td>0.135796</td>\n",
+       "      <td>0.973510</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table><p>"
     {
      "data": {
       "text/plain": [
+       "TrainOutput(global_step=1216, training_loss=0.08689324734242339, metrics={'train_runtime': 172.7465, 'train_samples_per_second': 111.516, 'train_steps_per_second': 7.039, 'total_flos': 62384098266840.0, 'train_loss': 0.08689324734242339, 'epoch': 16.0})"
       ]
      },
+     "execution_count": 37,
      "metadata": {},
      "output_type": "execute_result"
     }
     "    learning_rate=2e-5,\n",
     "    per_device_train_batch_size=16,\n",
     "    per_device_eval_batch_size=16,\n",
+    "    num_train_epochs=16,\n",
     "    weight_decay=0.01,\n",
     "    evaluation_strategy=\"epoch\",\n",
     "    save_strategy=\"epoch\",\n",

research/12_text_analytics_using_azure.ipynb ADDED Viewed

	@@ -0,0 +1,407 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ! pip install --upgrade azure-ai-textanalytics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "key = \"198414c4d7e54bde91ec77bf776d5211\"\n",
+    "endpoint = \"https://new-entity.cognitiveservices.azure.com/\"\n",
+    "# endpoint = \"https://eastus.api.cognitive.microsoft.com/\"\n",
+    "\n",
+    "from azure.ai.textanalytics import TextAnalyticsClient\n",
+    "from azure.core.credentials import AzureKeyCredential\n",
+    "\n",
+    "# Authenticate the client using your key and endpoint \n",
+    "def authenticate_client():\n",
+    "    ta_credential = AzureKeyCredential(key)\n",
+    "    text_analytics_client = TextAnalyticsClient(\n",
+    "            endpoint=endpoint, \n",
+    "            credential=ta_credential)\n",
+    "    return text_analytics_client\n",
+    "\n",
+    "client = authenticate_client()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Named Entities:\n",
+      "\n",
+      "\tText: \t razor kraken \tCategory: \t Organization \tSubCategory: \t None \n",
+      "\tConfidence Score: \t 0.54 \tLength: \t 12 \tOffset: \t 0 \n",
+      "\n",
+      "\tText: \t headphones \tCategory: \t Product \tSubCategory: \t None \n",
+      "\tConfidence Score: \t 0.5 \tLength: \t 10 \tOffset: \t 13 \n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "key = \"2fd114e7967a4da58854be231fd766a3\"\n",
+    "endpoint = \"https://entity-collection.cognitiveservices.azure.com/\"\n",
+    "# endpoint = \"https://eastus.api.cognitive.microsoft.com/\"\n",
+    "\n",
+    "from azure.ai.textanalytics import TextAnalyticsClient\n",
+    "from azure.core.credentials import AzureKeyCredential\n",
+    "\n",
+    "# Authenticate the client using your key and endpoint \n",
+    "def authenticate_client():\n",
+    "    ta_credential = AzureKeyCredential(key)\n",
+    "    text_analytics_client = TextAnalyticsClient(\n",
+    "            endpoint=endpoint, \n",
+    "            credential=ta_credential)\n",
+    "    return text_analytics_client\n",
+    "\n",
+    "client = authenticate_client()\n",
+    "\n",
+    "# Example function for recognizing entities from text\n",
+    "def entity_recognition_example(client):\n",
+    "\n",
+    "    try:\n",
+    "        documents = [\"razor kraken headphones\"]\n",
+    "        result = client.recognize_entities(documents = documents)[0]\n",
+    "\n",
+    "        print(\"Named Entities:\\n\")\n",
+    "        for entity in result.entities:\n",
+    "            print(\"\\tText: \\t\", entity.text, \"\\tCategory: \\t\", entity.category, \"\\tSubCategory: \\t\", entity.subcategory,\n",
+    "                    \"\\n\\tConfidence Score: \\t\", round(entity.confidence_score, 2), \"\\tLength: \\t\", entity.length, \"\\tOffset: \\t\", entity.offset, \"\\n\")\n",
+    "\n",
+    "    except Exception as err:\n",
+    "        print(\"Encountered exception. {}\".format(err))\n",
+    "entity_recognition_example(client)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def replace_original_text(original_text:str):\n",
+    "    try:\n",
+    "        result = client.recognize_entities(documents = [original_text])[0]\n",
+    "\n",
+    "        for entity in result.entities:\n",
+    "            # print(\"\\tText: \\t\", entity.text, \"\\tCategory: \\t\", entity.category, \"\\tSubCategory: \\t\", entity.subcategory,\n",
+    "            #         \"\\n\\tConfidence Score: \\t\", round(entity.confidence_score, 2), \"\\tLength: \\t\", entity.length, \"\\tOffset: \\t\", entity.offset, \"\\n\")\n",
+    "            original_text= original_text.replace(\n",
+    "                entity.text, \n",
+    "                entity.text+ f' ({entity.category}) '\n",
+    "            )\n",
+    "        return original_text\n",
+    "\n",
+    "    except Exception as err:\n",
+    "        \n",
+    "        print(\"Encountered exception. {}\".format(err))\n",
+    "        return original_text\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'best cat ear headphones (Product) '"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "replace_original_text(original_text=\"best cat ear headphones\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Barack Obama (Person)  in the White House (Location) '"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "replace_original_text(\n",
+    "    'Barack Obama in the White House'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.core.credentials import AzureKeyCredential\n",
+    "from azure.ai.textanalytics import TextAnalyticsClient\n",
+    "\n",
+    "credential = AzureKeyCredential(\"c8b849064d6649ea87cbd8fbbd39f708\")\n",
+    "text_analytics_client = TextAnalyticsClient(endpoint=\"https://entity-retrieval.cognitiveservices.azure.com/\", credential=credential)\n",
+    "# text_analytics_client = TextAnalyticsClient(endpoint=\"https://ktitji5.eastus.cognitiveservices.azure.com/\", credential=credential)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get the endpoint for the Language service resource\n",
+    "# ! az cognitiveservices account show --name \"resource-name\" --resource-group \"resource-group-name\" --query \"properties.endpoint\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = [\n",
+    "    {\"id\": \"1\", \"language\": \"en\", \"text\": \"I hated the movie. It was so slow!\"},\n",
+    "    {\"id\": \"2\", \"language\": \"en\", \"text\": \"The movie made it into my top ten favorites. What a great movie!\"},\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ClientAuthenticationError",
+     "evalue": "(401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m                 Traceback (most recent call last)",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_text_analytics_client.py:991\u001b[0m, in \u001b[0;36mTextAnalyticsClient.analyze_sentiment\u001b[0;34m(self, documents, **kwargs)\u001b[0m\n\u001b[1;32m    988\u001b[0m     models \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_client\u001b[39m.\u001b[39mmodels(api_version\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_api_version)\n\u001b[1;32m    989\u001b[0m     \u001b[39mreturn\u001b[39;00m cast(\n\u001b[1;32m    990\u001b[0m         List[Union[AnalyzeSentimentResult, DocumentError]],\n\u001b[0;32m--> 991\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_client\u001b[39m.\u001b[39;49manalyze_text(\n\u001b[1;32m    992\u001b[0m             body\u001b[39m=\u001b[39;49mmodels\u001b[39m.\u001b[39;49mAnalyzeTextSentimentAnalysisInput(\n\u001b[1;32m    993\u001b[0m                 analysis_input\u001b[39m=\u001b[39;49m{\u001b[39m\"\u001b[39;49m\u001b[39mdocuments\u001b[39;49m\u001b[39m\"\u001b[39;49m: docs},\n\u001b[1;32m    994\u001b[0m                 parameters\u001b[39m=\u001b[39;49mmodels\u001b[39m.\u001b[39;49mSentimentAnalysisTaskParameters(\n\u001b[1;32m    995\u001b[0m                     logging_opt_out\u001b[39m=\u001b[39;49mdisable_service_logs,\n\u001b[1;32m    996\u001b[0m                     model_version\u001b[39m=\u001b[39;49mmodel_version,\n\u001b[1;32m    997\u001b[0m                     string_index_type\u001b[39m=\u001b[39;49mstring_index_type_compatibility(string_index_type),\n\u001b[1;32m    998\u001b[0m                     opinion_mining\u001b[39m=\u001b[39;49mshow_opinion_mining,\n\u001b[1;32m    999\u001b[0m                 )\n\u001b[1;32m   1000\u001b[0m             ),\n\u001b[1;32m   1001\u001b[0m             show_stats\u001b[39m=\u001b[39;49mshow_stats,\n\u001b[1;32m   1002\u001b[0m             \u001b[39mcls\u001b[39;49m\u001b[39m=\u001b[39;49mkwargs\u001b[39m.\u001b[39;49mpop(\u001b[39m\"\u001b[39;49m\u001b[39mcls\u001b[39;49m\u001b[39m\"\u001b[39;49m, sentiment_result),\n\u001b[1;32m   1003\u001b[0m             \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs\n\u001b[1;32m   1004\u001b[0m         )\n\u001b[1;32m   1005\u001b[0m     )\n\u001b[1;32m   1007\u001b[0m \u001b[39m# api_versions 3.0, 3.1\u001b[39;00m\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_generated/_operations_mixin.py:109\u001b[0m, in \u001b[0;36mTextAnalyticsClientOperationsMixin.analyze_text\u001b[0;34m(self, body, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    108\u001b[0m mixin_instance\u001b[39m.\u001b[39m_deserialize \u001b[39m=\u001b[39m Deserializer(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_models_dict(api_version))\n\u001b[0;32m--> 109\u001b[0m \u001b[39mreturn\u001b[39;00m mixin_instance\u001b[39m.\u001b[39;49manalyze_text(body, show_stats, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:78\u001b[0m, in \u001b[0;36mdistributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     77\u001b[0m \u001b[39mif\u001b[39;00m span_impl_type \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m---> 78\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     80\u001b[0m \u001b[39m# Merge span is parameter is set, but only if no explicit parent are passed\u001b[39;00m\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_generated/v2022_05_01/operations/_text_analytics_client_operations.py:299\u001b[0m, in \u001b[0;36mTextAnalyticsClientOperationsMixin.analyze_text\u001b[0;34m(self, body, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    298\u001b[0m \u001b[39mif\u001b[39;00m response\u001b[39m.\u001b[39mstatus_code \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m [\u001b[39m200\u001b[39m]:\n\u001b[0;32m--> 299\u001b[0m     map_error(status_code\u001b[39m=\u001b[39;49mresponse\u001b[39m.\u001b[39;49mstatus_code, response\u001b[39m=\u001b[39;49mresponse, error_map\u001b[39m=\u001b[39;49merror_map)\n\u001b[1;32m    300\u001b[0m     error \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_deserialize\u001b[39m.\u001b[39mfailsafe_deserialize(_models\u001b[39m.\u001b[39mErrorResponse, pipeline_response)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/exceptions.py:165\u001b[0m, in \u001b[0;36mmap_error\u001b[0;34m(status_code, response, error_map)\u001b[0m\n\u001b[1;32m    164\u001b[0m error \u001b[39m=\u001b[39m error_type(response\u001b[39m=\u001b[39mresponse)\n\u001b[0;32m--> 165\u001b[0m \u001b[39mraise\u001b[39;00m error\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m: (401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.",
+      "\nThe above exception was the direct cause of the following exception:\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m                 Traceback (most recent call last)",
+      "\u001b[1;32m/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb Cell 12\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W4sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m response \u001b[39m=\u001b[39m text_analytics_client\u001b[39m.\u001b[39;49manalyze_sentiment(documents)\n\u001b[1;32m      <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W4sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1'>2</a>\u001b[0m successful_responses \u001b[39m=\u001b[39m [doc \u001b[39mfor\u001b[39;00m doc \u001b[39min\u001b[39;00m response \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m doc\u001b[39m.\u001b[39mis_error]\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:78\u001b[0m, in \u001b[0;36mdistributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     76\u001b[0m span_impl_type \u001b[39m=\u001b[39m settings\u001b[39m.\u001b[39mtracing_implementation()\n\u001b[1;32m     77\u001b[0m \u001b[39mif\u001b[39;00m span_impl_type \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m---> 78\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     80\u001b[0m \u001b[39m# Merge span is parameter is set, but only if no explicit parent are passed\u001b[39;00m\n\u001b[1;32m     81\u001b[0m \u001b[39mif\u001b[39;00m merge_span \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m passed_in_parent:\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_validate.py:74\u001b[0m, in \u001b[0;36mvalidate_multiapi_args.<locals>.decorator.<locals>.wrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     72\u001b[0m \u001b[39m# the latest version is selected, we assume all features supported\u001b[39;00m\n\u001b[1;32m     73\u001b[0m \u001b[39mif\u001b[39;00m selected_api_version \u001b[39m==\u001b[39m VERSIONS_SUPPORTED[\u001b[39m-\u001b[39m\u001b[39m1\u001b[39m]:\n\u001b[0;32m---> 74\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     76\u001b[0m \u001b[39mif\u001b[39;00m version_method_added \u001b[39mand\u001b[39;00m version_method_added \u001b[39m!=\u001b[39m selected_api_version \u001b[39mand\u001b[39;00m \\\n\u001b[1;32m     77\u001b[0m         VERSIONS_SUPPORTED\u001b[39m.\u001b[39mindex(selected_api_version) \u001b[39m<\u001b[39m VERSIONS_SUPPORTED\u001b[39m.\u001b[39mindex(version_method_added):\n\u001b[1;32m     78\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m     79\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m'\u001b[39m\u001b[39m{\u001b[39;00mclient\u001b[39m.\u001b[39m\u001b[39m__class__\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m is not available in API version \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m     80\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00mselected_api_version\u001b[39m}\u001b[39;00m\u001b[39m. Use service API version \u001b[39m\u001b[39m{\u001b[39;00mversion_method_added\u001b[39m}\u001b[39;00m\u001b[39m or newer.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m     81\u001b[0m     )\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_text_analytics_client.py:1022\u001b[0m, in \u001b[0;36mTextAnalyticsClient.analyze_sentiment\u001b[0;34m(self, documents, **kwargs)\u001b[0m\n\u001b[1;32m   1008\u001b[0m     \u001b[39mreturn\u001b[39;00m cast(\n\u001b[1;32m   1009\u001b[0m         List[Union[AnalyzeSentimentResult, DocumentError]],\n\u001b[1;32m   1010\u001b[0m         \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_client\u001b[39m.\u001b[39msentiment(\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m   1019\u001b[0m         )\n\u001b[1;32m   1020\u001b[0m     )\n\u001b[1;32m   1021\u001b[0m \u001b[39mexcept\u001b[39;00m HttpResponseError \u001b[39mas\u001b[39;00m error:\n\u001b[0;32m-> 1022\u001b[0m     \u001b[39mreturn\u001b[39;00m process_http_response_error(error)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_response_handlers.py:60\u001b[0m, in \u001b[0;36mprocess_http_response_error\u001b[0;34m(error)\u001b[0m\n\u001b[1;32m     58\u001b[0m \u001b[39mif\u001b[39;00m error\u001b[39m.\u001b[39mstatus_code \u001b[39m==\u001b[39m \u001b[39m404\u001b[39m:\n\u001b[1;32m     59\u001b[0m     raise_error \u001b[39m=\u001b[39m ResourceNotFoundError\n\u001b[0;32m---> 60\u001b[0m \u001b[39mraise\u001b[39;00m raise_error(response\u001b[39m=\u001b[39merror\u001b[39m.\u001b[39mresponse, error_format\u001b[39m=\u001b[39mCSODataV4Format) \u001b[39mfrom\u001b[39;00m \u001b[39merror\u001b[39;00m\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m: (401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
+     ]
+    }
+   ],
+   "source": [
+    "response = text_analytics_client.analyze_sentiment(documents)\n",
+    "successful_responses = [doc for doc in response if not doc.is_error]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "In this sample, we want to find the articles that mention Microsoft to read.\n"
+     ]
+    },
+    {
+     "ename": "ClientAuthenticationError",
+     "evalue": "(401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m                 Traceback (most recent call last)",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_text_analytics_client.py:900\u001b[0m, in \u001b[0;36mTextAnalyticsClient.extract_key_phrases\u001b[0;34m(self, documents, disable_service_logs, language, model_version, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    897\u001b[0m     models \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_client\u001b[39m.\u001b[39mmodels(api_version\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_api_version)\n\u001b[1;32m    898\u001b[0m     \u001b[39mreturn\u001b[39;00m cast(\n\u001b[1;32m    899\u001b[0m         List[Union[ExtractKeyPhrasesResult, DocumentError]],\n\u001b[0;32m--> 900\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_client\u001b[39m.\u001b[39;49manalyze_text(\n\u001b[1;32m    901\u001b[0m             body\u001b[39m=\u001b[39;49mmodels\u001b[39m.\u001b[39;49mAnalyzeTextKeyPhraseExtractionInput(\n\u001b[1;32m    902\u001b[0m                 analysis_input\u001b[39m=\u001b[39;49m{\u001b[39m\"\u001b[39;49m\u001b[39mdocuments\u001b[39;49m\u001b[39m\"\u001b[39;49m: docs},\n\u001b[1;32m    903\u001b[0m                 parameters\u001b[39m=\u001b[39;49mmodels\u001b[39m.\u001b[39;49mKeyPhraseTaskParameters(\n\u001b[1;32m    904\u001b[0m                     logging_opt_out\u001b[39m=\u001b[39;49mdisable_service_logs,\n\u001b[1;32m    905\u001b[0m                     model_version\u001b[39m=\u001b[39;49mmodel_version,\n\u001b[1;32m    906\u001b[0m                 )\n\u001b[1;32m    907\u001b[0m             ),\n\u001b[1;32m    908\u001b[0m             show_stats\u001b[39m=\u001b[39;49mshow_stats,\n\u001b[1;32m    909\u001b[0m             \u001b[39mcls\u001b[39;49m\u001b[39m=\u001b[39;49mkwargs\u001b[39m.\u001b[39;49mpop(\u001b[39m\"\u001b[39;49m\u001b[39mcls\u001b[39;49m\u001b[39m\"\u001b[39;49m, key_phrases_result),\n\u001b[1;32m    910\u001b[0m             \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs\n\u001b[1;32m    911\u001b[0m         )\n\u001b[1;32m    912\u001b[0m     )\n\u001b[1;32m    914\u001b[0m \u001b[39m# api_versions 3.0, 3.1\u001b[39;00m\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_generated/_operations_mixin.py:111\u001b[0m, in \u001b[0;36mTextAnalyticsClientOperationsMixin.analyze_text\u001b[0;34m(self, body, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    110\u001b[0m mixin_instance\u001b[39m.\u001b[39m_deserialize \u001b[39m=\u001b[39m Deserializer(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_models_dict(api_version))\n\u001b[0;32m--> 111\u001b[0m \u001b[39mreturn\u001b[39;00m mixin_instance\u001b[39m.\u001b[39;49manalyze_text(body, show_stats, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:78\u001b[0m, in \u001b[0;36mdistributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     77\u001b[0m \u001b[39mif\u001b[39;00m span_impl_type \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m---> 78\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     80\u001b[0m \u001b[39m# Merge span is parameter is set, but only if no explicit parent are passed\u001b[39;00m\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_generated/v2023_04_01/operations/_text_analytics_client_operations.py:299\u001b[0m, in \u001b[0;36mTextAnalyticsClientOperationsMixin.analyze_text\u001b[0;34m(self, body, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    298\u001b[0m \u001b[39mif\u001b[39;00m response\u001b[39m.\u001b[39mstatus_code \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m [\u001b[39m200\u001b[39m]:\n\u001b[0;32m--> 299\u001b[0m     map_error(status_code\u001b[39m=\u001b[39;49mresponse\u001b[39m.\u001b[39;49mstatus_code, response\u001b[39m=\u001b[39;49mresponse, error_map\u001b[39m=\u001b[39;49merror_map)\n\u001b[1;32m    300\u001b[0m     error \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_deserialize\u001b[39m.\u001b[39mfailsafe_deserialize(_models\u001b[39m.\u001b[39mErrorResponse, pipeline_response)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/exceptions.py:165\u001b[0m, in \u001b[0;36mmap_error\u001b[0;34m(status_code, response, error_map)\u001b[0m\n\u001b[1;32m    164\u001b[0m error \u001b[39m=\u001b[39m error_type(response\u001b[39m=\u001b[39mresponse)\n\u001b[0;32m--> 165\u001b[0m \u001b[39mraise\u001b[39;00m error\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m: (401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.",
+      "\nThe above exception was the direct cause of the following exception:\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m                 Traceback (most recent call last)",
+      "\u001b[1;32m/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb Cell 8\u001b[0m line \u001b[0;36m7\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=65'>66</a>\u001b[0m     \u001b[39mprint\u001b[39m(\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=66'>67</a>\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39mThe articles that mention Microsoft are articles number: \u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m. Those are the ones I\u001b[39m\u001b[39m'\u001b[39m\u001b[39mm interested in reading.\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=67'>68</a>\u001b[0m             \u001b[39m\"\u001b[39m\u001b[39m, \u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mjoin(articles_that_mention_microsoft)\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=68'>69</a>\u001b[0m         )\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=69'>70</a>\u001b[0m     )\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=72'>73</a>\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39m__name__\u001b[39m \u001b[39m==\u001b[39m \u001b[39m'\u001b[39m\u001b[39m__main__\u001b[39m\u001b[39m'\u001b[39m:\n\u001b[0;32m---> <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=73'>74</a>\u001b[0m     sample_extract_key_phrases()\n",
+      "\u001b[1;32m/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb Cell 8\u001b[0m line \u001b[0;36m5\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=37'>38</a>\u001b[0m text_analytics_client \u001b[39m=\u001b[39m TextAnalyticsClient(endpoint\u001b[39m=\u001b[39mendpoint, credential\u001b[39m=\u001b[39mAzureKeyCredential(key))\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=38'>39</a>\u001b[0m articles \u001b[39m=\u001b[39m [\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=39'>40</a>\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=40'>41</a>\u001b[0m \u001b[39m    Washington, D.C. Autumn in DC is a uniquely beautiful season. The leaves fall from the trees\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=51'>52</a>\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=52'>53</a>\u001b[0m ]\n\u001b[0;32m---> <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=54'>55</a>\u001b[0m result \u001b[39m=\u001b[39m text_analytics_client\u001b[39m.\u001b[39;49mextract_key_phrases(articles)\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=55'>56</a>\u001b[0m \u001b[39mfor\u001b[39;00m idx, doc \u001b[39min\u001b[39;00m \u001b[39menumerate\u001b[39m(result):\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22456d62656464696e6773227d/home/ubuntu/SentenceStructureComparision/research/12_text_analytics_using_azure.ipynb#W0sdnNjb2RlLXJlbW90ZQ%3D%3D?line=56'>57</a>\u001b[0m     \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m doc\u001b[39m.\u001b[39mis_error:\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:78\u001b[0m, in \u001b[0;36mdistributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     76\u001b[0m span_impl_type \u001b[39m=\u001b[39m settings\u001b[39m.\u001b[39mtracing_implementation()\n\u001b[1;32m     77\u001b[0m \u001b[39mif\u001b[39;00m span_impl_type \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m---> 78\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     80\u001b[0m \u001b[39m# Merge span is parameter is set, but only if no explicit parent are passed\u001b[39;00m\n\u001b[1;32m     81\u001b[0m \u001b[39mif\u001b[39;00m merge_span \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m passed_in_parent:\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_validate.py:79\u001b[0m, in \u001b[0;36mvalidate_multiapi_args.<locals>.decorator.<locals>.wrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     77\u001b[0m \u001b[39m# the latest version is selected, we assume all features supported\u001b[39;00m\n\u001b[1;32m     78\u001b[0m \u001b[39mif\u001b[39;00m selected_api_version \u001b[39m==\u001b[39m VERSIONS_SUPPORTED[\u001b[39m-\u001b[39m\u001b[39m1\u001b[39m]:\n\u001b[0;32m---> 79\u001b[0m     \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m     81\u001b[0m \u001b[39mif\u001b[39;00m version_method_added \u001b[39mand\u001b[39;00m version_method_added \u001b[39m!=\u001b[39m selected_api_version \u001b[39mand\u001b[39;00m \\\n\u001b[1;32m     82\u001b[0m         VERSIONS_SUPPORTED\u001b[39m.\u001b[39mindex(selected_api_version) \u001b[39m<\u001b[39m VERSIONS_SUPPORTED\u001b[39m.\u001b[39mindex(version_method_added):\n\u001b[1;32m     83\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m     84\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m'\u001b[39m\u001b[39m{\u001b[39;00mclient\u001b[39m.\u001b[39m\u001b[39m__class__\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m is not available in API version \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m     85\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00mselected_api_version\u001b[39m}\u001b[39;00m\u001b[39m. Use service API version \u001b[39m\u001b[39m{\u001b[39;00mversion_method_added\u001b[39m}\u001b[39;00m\u001b[39m or newer.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m     86\u001b[0m     )\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_text_analytics_client.py:927\u001b[0m, in \u001b[0;36mTextAnalyticsClient.extract_key_phrases\u001b[0;34m(self, documents, disable_service_logs, language, model_version, show_stats, **kwargs)\u001b[0m\n\u001b[1;32m    915\u001b[0m     \u001b[39mreturn\u001b[39;00m cast(\n\u001b[1;32m    916\u001b[0m         List[Union[ExtractKeyPhrasesResult, DocumentError]],\n\u001b[1;32m    917\u001b[0m         \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_client\u001b[39m.\u001b[39mkey_phrases(\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    924\u001b[0m         )\n\u001b[1;32m    925\u001b[0m     )\n\u001b[1;32m    926\u001b[0m \u001b[39mexcept\u001b[39;00m HttpResponseError \u001b[39mas\u001b[39;00m error:\n\u001b[0;32m--> 927\u001b[0m     \u001b[39mreturn\u001b[39;00m process_http_response_error(error)\n",
+      "File \u001b[0;32m~/SentenceStructureComparision/venv/lib/python3.10/site-packages/azure/ai/textanalytics/_response_handlers.py:63\u001b[0m, in \u001b[0;36mprocess_http_response_error\u001b[0;34m(error)\u001b[0m\n\u001b[1;32m     61\u001b[0m \u001b[39mif\u001b[39;00m error\u001b[39m.\u001b[39mstatus_code \u001b[39m==\u001b[39m \u001b[39m404\u001b[39m:\n\u001b[1;32m     62\u001b[0m     raise_error \u001b[39m=\u001b[39m ResourceNotFoundError\n\u001b[0;32m---> 63\u001b[0m \u001b[39mraise\u001b[39;00m raise_error(response\u001b[39m=\u001b[39merror\u001b[39m.\u001b[39mresponse, error_format\u001b[39m=\u001b[39mCSODataV4Format) \u001b[39mfrom\u001b[39;00m \u001b[39merror\u001b[39;00m\n",
+      "\u001b[0;31mClientAuthenticationError\u001b[0m: (401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.\nCode: 401\nMessage: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
+     ]
+    }
+   ],
+   "source": [
+    "# -------------------------------------------------------------------------\n",
+    "# Copyright (c) Microsoft Corporation. All rights reserved.\n",
+    "# Licensed under the MIT License. See License.txt in the project root for\n",
+    "# license information.\n",
+    "# --------------------------------------------------------------------------\n",
+    "\n",
+    "\"\"\"\n",
+    "FILE: sample_extract_key_phrases.py\n",
+    "\n",
+    "DESCRIPTION:\n",
+    "    This sample demonstrates how to extract key talking points from a batch of documents.\n",
+    "\n",
+    "    In this sample, we want to go over articles and read the ones that mention Microsoft.\n",
+    "    We're going to use the SDK to create a rudimentary search algorithm to find these articles.\n",
+    "\n",
+    "USAGE:\n",
+    "    python sample_extract_key_phrases.py\n",
+    "\n",
+    "    Set the environment variables with your own values before running the sample:\n",
+    "    1) AZURE_LANGUAGE_ENDPOINT - the endpoint to your Language resource.\n",
+    "    2) AZURE_LANGUAGE_KEY - your Language subscription key\n",
+    "\"\"\"\n",
+    "\n",
+    "\n",
+    "def sample_extract_key_phrases() -> None:\n",
+    "    print(\n",
+    "        \"In this sample, we want to find the articles that mention Microsoft to read.\"\n",
+    "    )\n",
+    "    articles_that_mention_microsoft = []\n",
+    "    # [START extract_key_phrases]\n",
+    "    import os\n",
+    "    from azure.core.credentials import AzureKeyCredential\n",
+    "    from azure.ai.textanalytics import TextAnalyticsClient\n",
+    "\n",
+    "    endpoint = \"https://xouhou-1234.cognitiveservices.azure.com/\"\n",
+    "    key = \"d7fcbf17455647adbca355b021334c83\"\n",
+    "\n",
+    "    text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=AzureKeyCredential(key))\n",
+    "    articles = [\n",
+    "        \"\"\"\n",
+    "        Washington, D.C. Autumn in DC is a uniquely beautiful season. The leaves fall from the trees\n",
+    "        in a city chock-full of forests, leaving yellow leaves on the ground and a clearer view of the\n",
+    "        blue sky above...\n",
+    "        \"\"\",\n",
+    "        \"\"\"\n",
+    "        Redmond, WA. In the past few days, Microsoft has decided to further postpone the start date of\n",
+    "        its United States workers, due to the pandemic that rages with no end in sight...\n",
+    "        \"\"\",\n",
+    "        \"\"\"\n",
+    "        Redmond, WA. Employees at Microsoft can be excited about the new coffee shop that will open on campus\n",
+    "        once workers no longer have to work remotely...\n",
+    "        \"\"\"\n",
+    "    ]\n",
+    "\n",
+    "    result = text_analytics_client.extract_key_phrases(articles)\n",
+    "    for idx, doc in enumerate(result):\n",
+    "        if not doc.is_error:\n",
+    "            print(\"Key phrases in article #{}: {}\".format(\n",
+    "                idx + 1,\n",
+    "                \", \".join(doc.key_phrases)\n",
+    "            ))\n",
+    "    # [END extract_key_phrases]\n",
+    "            if \"Microsoft\" in doc.key_phrases:\n",
+    "                articles_that_mention_microsoft.append(str(idx + 1))\n",
+    "\n",
+    "    print(\n",
+    "        \"The articles that mention Microsoft are articles number: {}. Those are the ones I'm interested in reading.\".format(\n",
+    "            \", \".join(articles_that_mention_microsoft)\n",
+    "        )\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    sample_extract_key_phrases()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

research/13_data_categories.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

utils/__pycache__/get_category.cpython-310.pyc CHANGED Viewed

Binary files a/utils/__pycache__/get_category.cpython-310.pyc and b/utils/__pycache__/get_category.cpython-310.pyc differ

utils/__pycache__/get_intent.cpython-310.pyc CHANGED Viewed

Binary files a/utils/__pycache__/get_intent.cpython-310.pyc and b/utils/__pycache__/get_intent.cpython-310.pyc differ