sdsam commited on
Commit
f7dcb95
·
verified ·
1 Parent(s): b248419

Delete trec_pointwise.ipynb

Browse files
Files changed (1) hide show
  1. trec_pointwise.ipynb +0 -964
trec_pointwise.ipynb DELETED
@@ -1,964 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "metadata": {},
6
- "source": [
7
- "# Setup"
8
- ]
9
- },
10
- {
11
- "cell_type": "markdown",
12
- "metadata": {},
13
- "source": [
14
- "First, we will import any dependencies, initialize GPT-3.5 Turbo model from OpenAI, and configure settings for the DSPy library."
15
- ]
16
- },
17
- {
18
- "cell_type": "code",
19
- "execution_count": 1,
20
- "metadata": {},
21
- "outputs": [
22
- {
23
- "name": "stdout",
24
- "output_type": "stream",
25
- "text": [
26
- "fatal: destination path 'pointwise_dspy' already exists and is not an empty directory.\n",
27
- "/future/u/sdsam/home/projects/pointwise_dspy\n",
28
- "Already on 'main'\n",
29
- "Your branch is up-to-date with 'origin/main'.\n",
30
- "/future/u/sdsam/home/projects\n"
31
- ]
32
- }
33
- ],
34
- "source": [
35
- "!git clone https://huggingface.co/sdsam/pointwise_dspy\n",
36
- "%cd pointwise_dspy/\n",
37
- "!git checkout main\n",
38
- "%cd ..\n",
39
- "import os\n",
40
- "repo_clone_path = './pointwise_dspy'\n",
41
- "\n",
42
- "# Set up the cache for this notebook\n",
43
- "os.environ[\"DSP_NOTEBOOK_CACHEDIR\"] = repo_clone_path"
44
- ]
45
- },
46
- {
47
- "cell_type": "code",
48
- "execution_count": 2,
49
- "metadata": {},
50
- "outputs": [],
51
- "source": [
52
- "%load_ext autoreload\n",
53
- "%autoreload 2\n",
54
- "\n",
55
- "import sys; sys.path.append('/future/u/okhattab/repos/public/stanfordnlp/dspy')\n",
56
- "import pytrec_eval\n",
57
- "import openai\n",
58
- "import dspy\n",
59
- "import tqdm\n",
60
- "import random\n",
61
- "import pytrec_eval\n",
62
- "import math\n",
63
- "import ujson\n",
64
- "import re\n",
65
- "from dspy.evaluate.evaluate import Evaluate\n",
66
- "\n",
67
- "# Initialize GPT-3.5 Turbo model from OpenAI\n",
68
- "turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')\n",
69
- "\n",
70
- "# Configure settings for dspy library\n",
71
- "dspy.settings.configure(lm=turbo)"
72
- ]
73
- },
74
- {
75
- "cell_type": "markdown",
76
- "metadata": {},
77
- "source": [
78
- "This loads and organizes query relevance judgments from the TREC 2019 DL track and randomly selects queries for development and training sets. You can use any dataset you want, but this notebook will use the TREC 2019 DL dataset for demonstration purposes."
79
- ]
80
- },
81
- {
82
- "cell_type": "code",
83
- "execution_count": 3,
84
- "metadata": {},
85
- "outputs": [],
86
- "source": [
87
- "query2qrels = {}\n",
88
- "\n",
89
- "with open('/future/u/sdsam/home/projects/rank_dspy/formatted_qrels/final_updated_merged_ranking.jsonl') as f:\n",
90
- " for line in f:\n",
91
- " example = dspy.Example(ujson.loads(line))\n",
92
- " query2qrels[example.query] = query2qrels.get(example.query, []) + [example]\n",
93
- "\n",
94
- "# there are a total of 43 queries in the dataset. \n",
95
- "# you can change the number that you want\n",
96
- "num_dev_queries = 43\n",
97
- "\n",
98
- "dev_queries = random.sample(list(query2qrels.keys()), num_dev_queries)\n",
99
- "train_queries = list(set(query2qrels.keys()) - set(dev_queries))"
100
- ]
101
- },
102
- {
103
- "cell_type": "markdown",
104
- "metadata": {},
105
- "source": [
106
- "Now, we processes the training queries to create a trainset with relevant information, including query, passages, relevance scores, and ColBERT ranks (initial ranked list ranking), then shuffles the dataset. We do the same process for the devset."
107
- ]
108
- },
109
- {
110
- "cell_type": "code",
111
- "execution_count": 4,
112
- "metadata": {},
113
- "outputs": [],
114
- "source": [
115
- "trainset = []\n",
116
- "for query in train_queries:\n",
117
- " passages = [example.passage for example in query2qrels[query]]\n",
118
- " rates_list = [int(example.qrel_score) for example in query2qrels[query]]\n",
119
- " colbert_rank = [int(example.rank) for example in query2qrels[query]]\n",
120
- " \n",
121
- " example = dspy.Example({\n",
122
- " 'query': query,\n",
123
- " 'passages': passages,\n",
124
- " 'rates_list': rates_list,\n",
125
- " 'colbert_rank': colbert_rank\n",
126
- " })\n",
127
- " \n",
128
- " trainset.append(example)\n",
129
- "\n",
130
- "random.Random(0).shuffle(trainset)\n"
131
- ]
132
- },
133
- {
134
- "cell_type": "code",
135
- "execution_count": 5,
136
- "metadata": {},
137
- "outputs": [],
138
- "source": [
139
- "devset = []\n",
140
- "for query in dev_queries:\n",
141
- " passages = [example.passage for example in query2qrels[query]]\n",
142
- " rates_list = [int(example.qrel_score) for example in query2qrels[query]]\n",
143
- " colbert_rank = [int(example.rank) for example in query2qrels[query]]\n",
144
- " \n",
145
- " example = dspy.Example({\n",
146
- " 'query': query,\n",
147
- " 'passages': passages,\n",
148
- " 'rates_list': rates_list,\n",
149
- " 'colbert_rank': colbert_rank\n",
150
- " })\n",
151
- " \n",
152
- " devset.append(example)\n",
153
- "\n",
154
- "random.Random(0).shuffle(devset)"
155
- ]
156
- },
157
- {
158
- "cell_type": "markdown",
159
- "metadata": {},
160
- "source": [
161
- "Here, we specifically assign 'query' and 'passages' as input fields for each example."
162
- ]
163
- },
164
- {
165
- "cell_type": "code",
166
- "execution_count": 6,
167
- "metadata": {},
168
- "outputs": [],
169
- "source": [
170
- "trainset = [example.with_inputs('query', 'passages') for example in trainset]\n",
171
- "devset = [example.with_inputs('query', 'passages') for example in devset]"
172
- ]
173
- },
174
- {
175
- "cell_type": "markdown",
176
- "metadata": {},
177
- "source": [
178
- "# DSPy"
179
- ]
180
- },
181
- {
182
- "cell_type": "markdown",
183
- "metadata": {},
184
- "source": [
185
- "Here we define a function that attempts to convert a given string to an integer or 0 if it fails. This helps us process the LM output."
186
- ]
187
- },
188
- {
189
- "cell_type": "code",
190
- "execution_count": 7,
191
- "metadata": {},
192
- "outputs": [],
193
- "source": [
194
- "def try_to_int(x: str):\n",
195
- " try: return int(re.findall(r'\\d+', x)[0]) - 1\n",
196
- " except Exception: return 0"
197
- ]
198
- },
199
- {
200
- "cell_type": "markdown",
201
- "metadata": {},
202
- "source": [
203
- "Here we define the module `RatePassage` that scores passages given a query using a `ChainOfThought` model. It limits the number of passages considered and sorts them based on scores, returning the result as a prediction."
204
- ]
205
- },
206
- {
207
- "cell_type": "code",
208
- "execution_count": 8,
209
- "metadata": {},
210
- "outputs": [],
211
- "source": [
212
- "class RatePassage(dspy.Module):\n",
213
- " def __init__(self, depth=10):\n",
214
- " super().__init__()\n",
215
- " self.score_passage = dspy.ChainOfThought(\"query, passages -> score\")\n",
216
- " self.depth = depth\n",
217
- "\n",
218
- " def forward(self, query, passages):\n",
219
- " passages = passages[:self.depth]\n",
220
- " scored_passages = []\n",
221
- "\n",
222
- " for passage in passages:\n",
223
- " score = self.score_passage(query=query, passages=passage).score\n",
224
- " score = try_to_int(score)\n",
225
- "\n",
226
- " scored_passages.append((score, passage))\n",
227
- "\n",
228
- " scored_passages.sort(key=lambda x: x[0], reverse=True)\n",
229
- "\n",
230
- " final_pred = {query: {passage: score for score, passage in scored_passages}}\n",
231
- " return dspy.Prediction(output=final_pred)"
232
- ]
233
- },
234
- {
235
- "cell_type": "markdown",
236
- "metadata": {},
237
- "source": [
238
- "# Evaluation"
239
- ]
240
- },
241
- {
242
- "cell_type": "markdown",
243
- "metadata": {},
244
- "source": [
245
- "This defines a validation function `validate_rating_v4` that evaluates the performance of a rating prediction output using the NDCG metric, comparing predicted scores to ground truth ratings for a given query and set of passages."
246
- ]
247
- },
248
- {
249
- "cell_type": "code",
250
- "execution_count": 9,
251
- "metadata": {},
252
- "outputs": [],
253
- "source": [
254
- "def validate_rating_v4(query, output, trace=None):\n",
255
- " output = output.output\n",
256
- "\n",
257
- " rates_list = query.rates_list\n",
258
- " passages = query.passages\n",
259
- "\n",
260
- " gt_dict = {passages[i]: int(rates_list[i]) for i in range(len(passages))}\n",
261
- "\n",
262
- " evaluator = pytrec_eval.RelevanceEvaluator({query.query: gt_dict}, {'ndcg_cut'})\n",
263
- " val = evaluator.evaluate({query.query: output[query.query]})\n",
264
- "\n",
265
- " available_metrics = val[query.query].keys()\n",
266
- "\n",
267
- " if 'ndcg_cut_10' in available_metrics:\n",
268
- " return val[query.query]['ndcg_cut_10']\n",
269
- " elif 'ndcg' in available_metrics:\n",
270
- " return val[query.query]['ndcg']\n",
271
- " else:\n",
272
- " raise KeyError(\"Expected NDCG metric not found in evaluation result.\")\n"
273
- ]
274
- },
275
- {
276
- "cell_type": "markdown",
277
- "metadata": {},
278
- "source": [
279
- "# Zeroshot and Compile"
280
- ]
281
- },
282
- {
283
- "cell_type": "markdown",
284
- "metadata": {},
285
- "source": [
286
- "Now we can instantiate a `RatePassage` model with a specified depth, set up evaluation on a subset of the development dataset (`devset`), and evaluate the model's performance using the `validate_rating_v4` metric!"
287
- ]
288
- },
289
- {
290
- "cell_type": "markdown",
291
- "metadata": {},
292
- "source": [
293
- "First, we will start with a zeroshot test"
294
- ]
295
- },
296
- {
297
- "cell_type": "code",
298
- "execution_count": 10,
299
- "metadata": {},
300
- "outputs": [
301
- {
302
- "name": "stderr",
303
- "output_type": "stream",
304
- "text": [
305
- " 0%| | 0/43 [00:00<?, ?it/s]"
306
- ]
307
- },
308
- {
309
- "name": "stderr",
310
- "output_type": "stream",
311
- "text": [
312
- "Average Metric: 24.274700567599893 / 43 (56.5): 100%|██████████| 43/43 [00:03<00:00, 12.13it/s]"
313
- ]
314
- },
315
- {
316
- "name": "stdout",
317
- "output_type": "stream",
318
- "text": [
319
- "Average Metric: 24.274700567599893 / 43 (56.5%)\n"
320
- ]
321
- },
322
- {
323
- "name": "stderr",
324
- "output_type": "stream",
325
- "text": [
326
- "\n"
327
- ]
328
- },
329
- {
330
- "data": {
331
- "text/plain": [
332
- "56.45"
333
- ]
334
- },
335
- "execution_count": 10,
336
- "metadata": {},
337
- "output_type": "execute_result"
338
- }
339
- ],
340
- "source": [
341
- "rate_passage_model = RatePassage(depth=40)\n",
342
- "\n",
343
- "\n",
344
- "kwargs = dict(num_threads=9, display_progress=True, display_table=0)\n",
345
- "evaluate = Evaluate(devset=devset[:43], metric=validate_rating_v4, **kwargs)\n",
346
- "\n",
347
- "# evaluate zero-shot\n",
348
- "evaluate(rate_passage_model)"
349
- ]
350
- },
351
- {
352
- "cell_type": "markdown",
353
- "metadata": {},
354
- "source": [
355
- "We can inspect the most recent conversation history for the GPT-3.5 Turbo model using the `inspect_history` function with a history depth of 1."
356
- ]
357
- },
358
- {
359
- "cell_type": "code",
360
- "execution_count": 14,
361
- "metadata": {},
362
- "outputs": [
363
- {
364
- "name": "stdout",
365
- "output_type": "stream",
366
- "text": [
367
- "\n",
368
- "\n",
369
- "\n",
370
- "\n",
371
- "Given the fields `query`, `passages`, produce the fields `score`.\n",
372
- "\n",
373
- "---\n",
374
- "\n",
375
- "Follow the following format.\n",
376
- "\n",
377
- "Query: ${query}\n",
378
- "\n",
379
- "Passages: ${passages}\n",
380
- "\n",
381
- "Reasoning: Let's think step by step in order to ${produce the score}. We ...\n",
382
- "\n",
383
- "Score: ${score}\n",
384
- "\n",
385
- "---\n",
386
- "\n",
387
- "Query: is cdg airport in main paris\n",
388
- "\n",
389
- "Passages: cdg, Paris Charles de Gaulle Airport, Paris, France weather text for Thu 20th April. The cdg, Paris Charles de Gaulle Airport, Paris, France weather is going to be sunny. cdg, Paris Charles de Gaulle Airport, Paris, France visibility is going to be around 20 km i.e. 12 miles and an atmospheric pressure of 1035 mb.\n",
390
- "\n",
391
- "Reasoning: Let's think step by step in order to\u001b[32m produce the score. We will first need to check if the passages contain the information related to the query, such as the location of CDG airport in Paris. Then, we can assign a score based on the relevance and accuracy of the information provided in the passages.\n",
392
- "\n",
393
- "Score: 8.5\u001b[0m\n",
394
- "\n",
395
- "\n",
396
- "\n"
397
- ]
398
- }
399
- ],
400
- "source": [
401
- "turbo.inspect_history(n=1)"
402
- ]
403
- },
404
- {
405
- "cell_type": "markdown",
406
- "metadata": {},
407
- "source": [
408
- "Compiled"
409
- ]
410
- },
411
- {
412
- "cell_type": "markdown",
413
- "metadata": {},
414
- "source": [
415
- "Now, we initialize a `SignatureOptimizer` from the `dspy` library, utilizing the `validate_rating_v4` metric, a history depth of 2, and the GPT-3.5 Turbo model as the prompt model. \n",
416
- "\n",
417
- "This time, we are compiling the optimizer with the `RatePassage` model and a subset of the `devset` for evaluation."
418
- ]
419
- },
420
- {
421
- "cell_type": "code",
422
- "execution_count": 11,
423
- "metadata": {},
424
- "outputs": [
425
- {
426
- "name": "stdout",
427
- "output_type": "stream",
428
- "text": [
429
- "./pointwise_dspy/compiler\n",
430
- "Starting iteration 0/2.\n",
431
- "----------------\n",
432
- "Predictor 0\n",
433
- "i: Given the field `query` and the fields `passages`, calculate the relevancy score for each passage related to the query.\n",
434
- "p: Relevancy scores:\n",
435
- "\n"
436
- ]
437
- },
438
- {
439
- "name": "stderr",
440
- "output_type": "stream",
441
- "text": [
442
- "Average Metric: 8.402444304089686 / 20 (42.0): 100%|██████████| 20/20 [00:02<00:00, 8.99it/s]\n"
443
- ]
444
- },
445
- {
446
- "name": "stdout",
447
- "output_type": "stream",
448
- "text": [
449
- "Average Metric: 8.402444304089686 / 20 (42.0%)\n",
450
- "----------------\n",
451
- "----------------\n",
452
- "Predictor 0\n",
453
- "i: Identify and match the most relevant passage to the query based on semantic similarity and produce a score representing the degree of relevance.\n",
454
- "p: Relevant passage scoring for query:\n",
455
- "\n"
456
- ]
457
- },
458
- {
459
- "name": "stderr",
460
- "output_type": "stream",
461
- "text": [
462
- "Average Metric: 8.62779048156313 / 20 (43.1): 100%|██████████| 20/20 [00:02<00:00, 8.66it/s] \n"
463
- ]
464
- },
465
- {
466
- "name": "stdout",
467
- "output_type": "stream",
468
- "text": [
469
- "Average Metric: 8.62779048156313 / 20 (43.1%)\n",
470
- "----------------\n",
471
- "----------------\n",
472
- "Predictor 0\n",
473
- "i: Feed the language model with the query and passages and generate a score based on the relevance of each passage to the query.\n",
474
- "p: Scoring for query and passages.\n",
475
- "\n"
476
- ]
477
- },
478
- {
479
- "name": "stderr",
480
- "output_type": "stream",
481
- "text": [
482
- "Average Metric: 10.069514300349399 / 20 (50.3): 100%|██████████| 20/20 [00:04<00:00, 4.45it/s]\n"
483
- ]
484
- },
485
- {
486
- "name": "stdout",
487
- "output_type": "stream",
488
- "text": [
489
- "Average Metric: 10.069514300349399 / 20 (50.3%)\n",
490
- "----------------\n",
491
- "----------------\n",
492
- "Predictor 0\n",
493
- "i: Before you start computing the scores, carefully analyze the query and compare it to the passages. Take into consideration both relevance and potential keywords that might influence the score calculation.\n",
494
- "p: Scores for query analysis:\n",
495
- "\n"
496
- ]
497
- },
498
- {
499
- "name": "stderr",
500
- "output_type": "stream",
501
- "text": [
502
- "Average Metric: 11.351022284735192 / 20 (56.8): 100%|██████████| 20/20 [00:02<00:00, 8.22it/s]\n"
503
- ]
504
- },
505
- {
506
- "name": "stdout",
507
- "output_type": "stream",
508
- "text": [
509
- "Average Metric: 11.351022284735192 / 20 (56.8%)\n",
510
- "----------------\n",
511
- "----------------\n",
512
- "Predictor 0\n",
513
- "i: Use the query to compare and evaluate the relevance of each passage.\n",
514
- "p: Relevance_scores_\n",
515
- "\n"
516
- ]
517
- },
518
- {
519
- "name": "stderr",
520
- "output_type": "stream",
521
- "text": [
522
- "Average Metric: 11.48445319834673 / 20 (57.4): 100%|██████████| 20/20 [00:02<00:00, 7.84it/s] \n"
523
- ]
524
- },
525
- {
526
- "name": "stdout",
527
- "output_type": "stream",
528
- "text": [
529
- "Average Metric: 11.48445319834673 / 20 (57.4%)\n",
530
- "----------------\n",
531
- "----------------\n",
532
- "Predictor 0\n",
533
- "i: First, perform an initial keyword search of the query within the passages. Then, compare the search results to the original query to identify relevant passages.\n",
534
- "p: Scored results:\n",
535
- "\n"
536
- ]
537
- },
538
- {
539
- "name": "stderr",
540
- "output_type": "stream",
541
- "text": [
542
- "Average Metric: 11.605294980491518 / 20 (58.0): 100%|██████████| 20/20 [00:02<00:00, 8.20it/s]\n"
543
- ]
544
- },
545
- {
546
- "name": "stdout",
547
- "output_type": "stream",
548
- "text": [
549
- "Average Metric: 11.605294980491518 / 20 (58.0%)\n",
550
- "----------------\n",
551
- "----------------\n",
552
- "Predictor 0\n",
553
- "i: First, identify the most relevant passages to the given query. Then, calculate a score for each passage based on its relevance to the query.\n",
554
- "p: Relevant passages and their scores:\n",
555
- "\n"
556
- ]
557
- },
558
- {
559
- "name": "stderr",
560
- "output_type": "stream",
561
- "text": [
562
- "Average Metric: 10.581135809365799 / 20 (52.9): 100%|██████████| 20/20 [00:02<00:00, 8.08it/s]\n"
563
- ]
564
- },
565
- {
566
- "name": "stdout",
567
- "output_type": "stream",
568
- "text": [
569
- "Average Metric: 10.581135809365799 / 20 (52.9%)\n",
570
- "----------------\n",
571
- "----------------\n",
572
- "Predictor 0\n",
573
- "i: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.\n",
574
- "p: Calculation of relevance score for passages in relation to query:\n",
575
- "\n"
576
- ]
577
- },
578
- {
579
- "name": "stderr",
580
- "output_type": "stream",
581
- "text": [
582
- "Average Metric: 12.76273405727381 / 20 (63.8): 100%|██████████| 20/20 [00:02<00:00, 8.03it/s] \n"
583
- ]
584
- },
585
- {
586
- "name": "stdout",
587
- "output_type": "stream",
588
- "text": [
589
- "Average Metric: 12.76273405727381 / 20 (63.8%)\n",
590
- "----------------\n",
591
- "----------------\n",
592
- "Predictor 0\n",
593
- "i: Given the fields `query`, `passages`, produce the fields `score`.\n",
594
- "p: Score:\n",
595
- "\n"
596
- ]
597
- },
598
- {
599
- "name": "stderr",
600
- "output_type": "stream",
601
- "text": [
602
- "Average Metric: 10.615206363813893 / 20 (53.1): 100%|██████████| 20/20 [00:00<00:00, 957.64it/s]\n"
603
- ]
604
- },
605
- {
606
- "name": "stdout",
607
- "output_type": "stream",
608
- "text": [
609
- "Average Metric: 10.615206363813893 / 20 (53.1%)\n",
610
- "----------------\n",
611
- "Updating Predictor 140321429486800 to:\n",
612
- "i: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.\n",
613
- "p: Calculation of relevance score for passages in relation to query:\n",
614
- "Full predictor with update: \n",
615
- "Predictor 0\n",
616
- "i: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.\n",
617
- "p: Calculation of relevance score for passages in relation to query:\n",
618
- "\n",
619
- "\n",
620
- "\n",
621
- "\n",
622
- "\n",
623
- "You are an experienced instruction optimizer for large language models. I will give some task instructions I've tried, along with their corresponding validation scores. The instructions are arranged in increasing order based on their scores, where higher scores indicate better quality. Your task is to propose a new instruction that will lead a good language model to perform the task even better. Don't be afraid to be creative.\n",
624
- "\n",
625
- "---\n",
626
- "\n",
627
- "Follow the following format.\n",
628
- "\n",
629
- "Attempted Instructions: ${attempted_instructions}\n",
630
- "Proposed Instruction: The improved instructions for the language model\n",
631
- "Proposed Prefix For Output Field: The string at the end of the prompt, which will help the model start solving the task\n",
632
- "\n",
633
- "---\n",
634
- "\n",
635
- "Attempted Instructions:\n",
636
- "[1] «Instruction #1: Given the field `query` and the fields `passages`, calculate the relevancy score for each passage related to the query.»\n",
637
- "[2] «Prefix #1: Relevancy scores:»\n",
638
- "[3] «Resulting Score #1: 42.01»\n",
639
- "[4] «Instruction #2: Identify and match the most relevant passage to the query based on semantic similarity and produce a score representing the degree of relevance.»\n",
640
- "[5] «Prefix #2: Relevant passage scoring for query:»\n",
641
- "[6] «Resulting Score #2: 43.14»\n",
642
- "[7] «Instruction #3: Feed the language model with the query and passages and generate a score based on the relevance of each passage to the query.»\n",
643
- "[8] «Prefix #3: Scoring for query and passages.»\n",
644
- "[9] «Resulting Score #3: 50.35»\n",
645
- "[10] «Instruction #4: First, identify the most relevant passages to the given query. Then, calculate a score for each passage based on its relevance to the query.»\n",
646
- "[11] «Prefix #4: Relevant passages and their scores:»\n",
647
- "[12] «Resulting Score #4: 52.91»\n",
648
- "[13] «Instruction #5: Given the fields `query`, `passages`, produce the fields `score`.»\n",
649
- "[14] «Prefix #5: Score:»\n",
650
- "[15] «Resulting Score #5: 53.08»\n",
651
- "[16] «Instruction #6: Before you start computing the scores, carefully analyze the query and compare it to the passages. Take into consideration both relevance and potential keywords that might influence the score calculation.»\n",
652
- "[17] «Prefix #6: Scores for query analysis:»\n",
653
- "[18] «Resulting Score #6: 56.76»\n",
654
- "[19] «Instruction #7: Use the query to compare and evaluate the relevance of each passage.»\n",
655
- "[20] «Prefix #7: Relevance_scores_»\n",
656
- "[21] «Resulting Score #7: 57.42»\n",
657
- "[22] «Instruction #8: First, perform an initial keyword search of the query within the passages. Then, compare the search results to the original query to identify relevant passages.»\n",
658
- "[23] «Prefix #8: Scored results:»\n",
659
- "[24] «Resulting Score #8: 58.03»\n",
660
- "[25] «Instruction #9: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.»\n",
661
- "[26] «Prefix #9: Calculation of relevance score for passages in relation to query:»\n",
662
- "[27] «Resulting Score #9: 63.81»\n",
663
- "Proposed Instruction:\u001b[32m Given the fields `query`, and `passages`, analyze the semantic similarity of each passage to the query and generate a relevancy score for each passage based on the analysis.\n",
664
- "Proposed Prefix For Output Field: Improved relevancy scores:\u001b[0m\u001b[31m \t (and 9 other completions)\u001b[0m\n",
665
- "\n",
666
- "\n",
667
- "\n",
668
- "None\n",
669
- "Starting iteration 1/2.\n",
670
- "----------------\n",
671
- "Predictor 0\n",
672
- "i: Given the fields `query`, and `passages`, analyze the semantic similarity of each passage to the query and generate a relevancy score for each passage based on the analysis.\n",
673
- "p: Improved relevancy scores:\n",
674
- "\n"
675
- ]
676
- },
677
- {
678
- "name": "stderr",
679
- "output_type": "stream",
680
- "text": [
681
- "Average Metric: 8.854505551688998 / 20 (44.3): 100%|██████████| 20/20 [03:06<00:00, 9.32s/it] \n"
682
- ]
683
- },
684
- {
685
- "name": "stdout",
686
- "output_type": "stream",
687
- "text": [
688
- "Average Metric: 8.854505551688998 / 20 (44.3%)\n",
689
- "----------------\n",
690
- "----------------\n",
691
- "Predictor 0\n",
692
- "i: Given the fields 'query' and 'passages', apply a natural language processing model to analyze semantic similarity and contextual relevance in order to calculate and assign a relevancy score to each passage corresponding to the query.\n",
693
- "p: Relevance scores computation for query:\n",
694
- "\n"
695
- ]
696
- },
697
- {
698
- "name": "stderr",
699
- "output_type": "stream",
700
- "text": [
701
- "Average Metric: 9.305273086417204 / 20 (46.5): 100%|██████████| 20/20 [03:12<00:00, 9.60s/it]\n"
702
- ]
703
- },
704
- {
705
- "name": "stdout",
706
- "output_type": "stream",
707
- "text": [
708
- "Average Metric: 9.305273086417204 / 20 (46.5%)\n",
709
- "----------------\n",
710
- "----------------\n",
711
- "Predictor 0\n",
712
- "i: Proposed Instruction: Utilize semantic analysis and keyword extraction to identify the most relevant passages to the given query. Calculate a relevancy score for each passage based on its semantic similarity and keyword matches with the query. Consider the overall quality of the passage when assigning a numeric value to its relevance to the query.\n",
713
- "p: Relevancy scores for query and passages:\n",
714
- "\n"
715
- ]
716
- },
717
- {
718
- "name": "stderr",
719
- "output_type": "stream",
720
- "text": [
721
- "Average Metric: 8.46655847511304 / 20 (42.3): 100%|██████████| 20/20 [03:28<00:00, 10.44s/it] \n"
722
- ]
723
- },
724
- {
725
- "name": "stdout",
726
- "output_type": "stream",
727
- "text": [
728
- "Average Metric: 8.46655847511304 / 20 (42.3%)\n",
729
- "----------------\n",
730
- "----------------\n",
731
- "Predictor 0\n",
732
- "i: Given the fields `query`, `passages`, and `keywords`, utilize a multi-faceted approach that includes semantic similarity, keyword analysis, and contextual relevance to calculate a comprehensive score for each passage in relation to the query.\n",
733
- "p: Comprehensive passage relevance scores based on query analysis and keyword relevance:\n",
734
- "\n"
735
- ]
736
- },
737
- {
738
- "name": "stderr",
739
- "output_type": "stream",
740
- "text": [
741
- "Average Metric: 11.722319635627931 / 20 (58.6): 100%|██████████| 20/20 [03:30<00:00, 10.54s/it]\n"
742
- ]
743
- },
744
- {
745
- "name": "stdout",
746
- "output_type": "stream",
747
- "text": [
748
- "Average Metric: 11.722319635627931 / 20 (58.6%)\n",
749
- "----------------\n",
750
- "----------------\n",
751
- "Predictor 0\n",
752
- "i: Proposed Instruction: Train the model to identify not only the semantic similarity between the query and passages but also the contextual relevance and significance of the information within the passages in relation to the query.\n",
753
- "p: Improved relevancy scores:\n",
754
- "\n"
755
- ]
756
- },
757
- {
758
- "name": "stderr",
759
- "output_type": "stream",
760
- "text": [
761
- "Average Metric: 10.353255566525437 / 20 (51.8): 100%|██████████| 20/20 [03:19<00:00, 9.99s/it]\n"
762
- ]
763
- },
764
- {
765
- "name": "stdout",
766
- "output_type": "stream",
767
- "text": [
768
- "Average Metric: 10.353255566525437 / 20 (51.8%)\n",
769
- "----------------\n",
770
- "----------------\n",
771
- "Predictor 0\n",
772
- "i: Calculate the relevancy score for each passage related to the query by taking into consideration the semantic similarity, potential keywords, and overall quality of the passage.\n",
773
- "p: Enhanced relevancy scores:\n",
774
- "\n"
775
- ]
776
- },
777
- {
778
- "name": "stderr",
779
- "output_type": "stream",
780
- "text": [
781
- "Average Metric: 12.415814266014621 / 20 (62.1): 100%|██████████| 20/20 [03:14<00:00, 9.71s/it]\n"
782
- ]
783
- },
784
- {
785
- "name": "stdout",
786
- "output_type": "stream",
787
- "text": [
788
- "Average Metric: 12.415814266014621 / 20 (62.1%)\n",
789
- "----------------\n",
790
- "----------------\n",
791
- "Predictor 0\n",
792
- "i: Instruction #10: Utilize semantic embedding to analyze the query and passages, then calculate a relevance score for each passage based on semantic similarity and key keyword analysis.\n",
793
- "p: Score and relevance rankings based on semantic analysis.\n",
794
- "\n"
795
- ]
796
- },
797
- {
798
- "name": "stderr",
799
- "output_type": "stream",
800
- "text": [
801
- "Average Metric: 8.25174783595295 / 20 (41.3): 100%|██████████| 20/20 [06:30<00:00, 19.54s/it] \n"
802
- ]
803
- },
804
- {
805
- "name": "stdout",
806
- "output_type": "stream",
807
- "text": [
808
- "Average Metric: 8.25174783595295 / 20 (41.3%)\n",
809
- "----------------\n",
810
- "----------------\n",
811
- "Predictor 0\n",
812
- "i: Proposed Instruction:\n",
813
- "Given the input fields `query` and `passages`, analyze the semantic relevance of each passage to the query, taking into account key contextual information within the query and passages to generate a comprehensive relevance score.\n",
814
- "p: Semantic relevance scores for passages to query:\n",
815
- "\n"
816
- ]
817
- },
818
- {
819
- "name": "stderr",
820
- "output_type": "stream",
821
- "text": [
822
- "Average Metric: 11.49267383394411 / 20 (57.5): 100%|██████████| 20/20 [03:25<00:00, 10.26s/it] \n"
823
- ]
824
- },
825
- {
826
- "name": "stdout",
827
- "output_type": "stream",
828
- "text": [
829
- "Average Metric: 11.49267383394411 / 20 (57.5%)\n",
830
- "----------------\n",
831
- "----------------\n",
832
- "Predictor 0\n",
833
- "i: Instruction #10: Utilize contextual understanding and semantic analysis to assess the relevancy and coherence of each passage with respect to the given query, and generate a score that encapsulates both aspects.\n",
834
- "p: Semantic relevancy assessments:\n",
835
- "\n"
836
- ]
837
- },
838
- {
839
- "name": "stderr",
840
- "output_type": "stream",
841
- "text": [
842
- "Average Metric: 10.245880431686144 / 20 (51.2): 100%|██████████| 20/20 [03:30<00:00, 10.52s/it]\n"
843
- ]
844
- },
845
- {
846
- "name": "stdout",
847
- "output_type": "stream",
848
- "text": [
849
- "Average Metric: 10.245880431686144 / 20 (51.2%)\n",
850
- "----------------\n",
851
- "----------------\n",
852
- "Predictor 0\n",
853
- "i: The improved instructions for the language model could be: \"Examine each passage to scrutinize semantic congruity with the provided query and derive a relevance score for each passage.\n",
854
- "p: Semantic Relevancy:\n",
855
- "\n"
856
- ]
857
- },
858
- {
859
- "name": "stderr",
860
- "output_type": "stream",
861
- "text": [
862
- "Average Metric: 12.53923996601208 / 20 (62.7): 100%|██████████| 20/20 [03:19<00:00, 9.96s/it] "
863
- ]
864
- },
865
- {
866
- "name": "stdout",
867
- "output_type": "stream",
868
- "text": [
869
- "Average Metric: 12.53923996601208 / 20 (62.7%)\n",
870
- "----------------\n",
871
- "Updating Predictor 140321429486800 to:\n",
872
- "i: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.\n",
873
- "p: Calculation of relevance score for passages in relation to query:\n",
874
- "Full predictor with update: \n",
875
- "Predictor 0\n",
876
- "i: Before generating `score`, carefully analyze each `passage` in relation to the `query` and assign a numeric value based on relevance and overall quality of the `passage`.\n",
877
- "p: Calculation of relevance score for passages in relation to query:\n",
878
- "\n"
879
- ]
880
- },
881
- {
882
- "name": "stderr",
883
- "output_type": "stream",
884
- "text": [
885
- "\n"
886
- ]
887
- }
888
- ],
889
- "source": [
890
- "from dspy.teleprompt import SignatureOptimizer\n",
891
- "\n",
892
- "\n",
893
- "teleprompter = SignatureOptimizer(metric=validate_rating_v4, depth=2, verbose=True, prompt_model=turbo)\n",
894
- "compiled_prompt_opt = teleprompter.compile(rate_passage_model.deepcopy(), devset=devset[:20], eval_kwargs=kwargs)"
895
- ]
896
- },
897
- {
898
- "cell_type": "markdown",
899
- "metadata": {},
900
- "source": [
901
- "Finnally, let's evaluate the compiled `SignatureOptimizer` model, `compiled_prompt_opt`, using the defined evaluation settings."
902
- ]
903
- },
904
- {
905
- "cell_type": "code",
906
- "execution_count": 12,
907
- "metadata": {},
908
- "outputs": [
909
- {
910
- "name": "stderr",
911
- "output_type": "stream",
912
- "text": [
913
- "Average Metric: 28.321976567062965 / 43 (65.9): 100%|██████████| 43/43 [00:31<00:00, 1.36it/s] "
914
- ]
915
- },
916
- {
917
- "name": "stdout",
918
- "output_type": "stream",
919
- "text": [
920
- "Average Metric: 28.321976567062965 / 43 (65.9%)\n"
921
- ]
922
- },
923
- {
924
- "name": "stderr",
925
- "output_type": "stream",
926
- "text": [
927
- "\n"
928
- ]
929
- }
930
- ],
931
- "source": [
932
- "eval_score = evaluate(compiled_prompt_opt)"
933
- ]
934
- },
935
- {
936
- "cell_type": "code",
937
- "execution_count": null,
938
- "metadata": {},
939
- "outputs": [],
940
- "source": []
941
- }
942
- ],
943
- "metadata": {
944
- "kernelspec": {
945
- "display_name": "base",
946
- "language": "python",
947
- "name": "python3"
948
- },
949
- "language_info": {
950
- "codemirror_mode": {
951
- "name": "ipython",
952
- "version": 3
953
- },
954
- "file_extension": ".py",
955
- "mimetype": "text/x-python",
956
- "name": "python",
957
- "nbconvert_exporter": "python",
958
- "pygments_lexer": "ipython3",
959
- "version": "3.9.16"
960
- }
961
- },
962
- "nbformat": 4,
963
- "nbformat_minor": 2
964
- }