rotemso23 Claude Sonnet 4.6 commited on
Commit
cd3c9f3
·
1 Parent(s): 1954903

Add Phase 4: ROUGE evaluation script and Colab notebook

Browse files

Compares fine-tuned LoRA adapter vs zero-shot baseline on DialogSum test split.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. notebooks/evaluate_colab.ipynb +201 -0
  2. src/evaluate.py +279 -0
notebooks/evaluate_colab.ipynb ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "a1000001",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Dialogue Summarizer — Evaluation on Colab T4\n",
9
+ "\n",
10
+ "Runs ROUGE evaluation on the DialogSum test split (819 examples).\n",
11
+ "Compares the fine-tuned LoRA adapter (`rotemso23/dialogsum-phi3-lora`) against the zero-shot baseline.\n",
12
+ "\n",
13
+ "**Before running:**\n",
14
+ "1. Set Runtime → Change runtime type → **T4 GPU**\n",
15
+ "2. Add your HuggingFace token in the Colab Secrets tab (key icon, name: `HF_TOKEN`)\n",
16
+ "\n",
17
+ "**Expected runtime:** ~30–60 minutes (two inference passes over 819 examples)."
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "id": "a1000002",
23
+ "metadata": {},
24
+ "source": [
25
+ "## 1. Verify GPU"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": null,
31
+ "id": "a1000003",
32
+ "metadata": {},
33
+ "outputs": [],
34
+ "source": [
35
+ "import torch\n",
36
+ "assert torch.cuda.is_available(), \"No GPU found! Set Runtime → Change runtime type → T4 GPU\"\n",
37
+ "print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
38
+ "print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "markdown",
43
+ "id": "a1000004",
44
+ "metadata": {},
45
+ "source": [
46
+ "## 2. Install dependencies"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "id": "a1000005",
53
+ "metadata": {},
54
+ "outputs": [],
55
+ "source": [
56
+ "# Colab already has torch, so we skip it to avoid version conflicts\n",
57
+ "!pip install -q \\\n",
58
+ " \"datasets>=2.0.0\" \\\n",
59
+ " \"transformers>=4.40.0\" \\\n",
60
+ " \"peft>=0.19.0\" \\\n",
61
+ " \"bitsandbytes>=0.43.0\" \\\n",
62
+ " \"accelerate>=0.30.0\" \\\n",
63
+ " \"rouge-score==0.1.2\" \\\n",
64
+ " \"python-dotenv==1.0.1\"\n",
65
+ "print(\"Dependencies installed.\")"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "markdown",
70
+ "id": "a1000006",
71
+ "metadata": {},
72
+ "source": [
73
+ "## 3. Clone the repo"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "a1000007",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "import os\n",
84
+ "\n",
85
+ "REPO_URL = \"https://github.com/rotemso23/dialogue-summarizer.git\"\n",
86
+ "REPO_DIR = \"dialogue-summarizer\"\n",
87
+ "\n",
88
+ "if os.path.exists(REPO_DIR):\n",
89
+ " !git -C {REPO_DIR} pull\n",
90
+ "else:\n",
91
+ " !git clone {REPO_URL}\n",
92
+ "\n",
93
+ "os.chdir(REPO_DIR)\n",
94
+ "print(f\"Working directory: {os.getcwd()}\")\n",
95
+ "!ls"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "markdown",
100
+ "id": "a1000008",
101
+ "metadata": {},
102
+ "source": [
103
+ "## 4. Set HuggingFace token\n",
104
+ "\n",
105
+ "Your token needs **read** permissions (write not required for evaluation). \n",
106
+ "Get one at: https://huggingface.co/settings/tokens"
107
+ ]
108
+ },
109
+ {
110
+ "cell_type": "code",
111
+ "execution_count": null,
112
+ "id": "a1000009",
113
+ "metadata": {},
114
+ "outputs": [],
115
+ "source": [
116
+ "from google.colab import userdata\n",
117
+ "import os\n",
118
+ "\n",
119
+ "# Option A: read from Colab Secrets (Secrets tab on the left sidebar → add HF_TOKEN)\n",
120
+ "try:\n",
121
+ " os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n",
122
+ " print(\"HF_TOKEN loaded from Colab Secrets.\")\n",
123
+ "except Exception:\n",
124
+ " # Option B: paste directly (don't commit this)\n",
125
+ " os.environ[\"HF_TOKEN\"] = \"hf_xxx_YOUR_TOKEN_HERE\"\n",
126
+ " print(\"HF_TOKEN set manually — remember not to commit this notebook with a real token.\")\n",
127
+ "\n",
128
+ "# Write to .env so evaluate.py can find it via python-dotenv\n",
129
+ "with open(\".env\", \"w\") as f:\n",
130
+ " f.write(f'HF_TOKEN={os.environ[\"HF_TOKEN\"]}\\n')\n",
131
+ "print(\"Token written to .env\")"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "markdown",
136
+ "id": "a1000010",
137
+ "metadata": {},
138
+ "source": [
139
+ "## 5. Run evaluation\n",
140
+ "\n",
141
+ "This runs two full inference passes over the 819-example test split:\n",
142
+ "1. Fine-tuned model (`rotemso23/dialogsum-phi3-lora`)\n",
143
+ "2. Zero-shot baseline (same base model, no adapter)\n",
144
+ "\n",
145
+ "Results are saved to `evaluation_results.json`."
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": null,
151
+ "id": "a1000011",
152
+ "metadata": {},
153
+ "outputs": [],
154
+ "source": [
155
+ "!PYTHONPATH=/content/dialogue-summarizer python src/evaluate.py"
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "markdown",
160
+ "id": "a1000012",
161
+ "metadata": {},
162
+ "source": [
163
+ "## 6. View results"
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "execution_count": null,
169
+ "id": "a1000013",
170
+ "metadata": {},
171
+ "outputs": [],
172
+ "source": [
173
+ "import json\n",
174
+ "\n",
175
+ "with open(\"evaluation_results.json\") as f:\n",
176
+ " results = json.load(f)\n",
177
+ "\n",
178
+ "print(f\"{'Metric':<12} {'Baseline':>10} {'Fine-tuned':>12} {'Delta':>10}\")\n",
179
+ "print(\"-\" * 52)\n",
180
+ "for k in [\"rouge1\", \"rouge2\", \"rougeL\"]:\n",
181
+ " base_val = results[\"baseline\"][k]\n",
182
+ " ft_val = results[\"fine_tuned\"][k]\n",
183
+ " delta = ft_val - base_val\n",
184
+ " print(f\"{k:<12} {base_val:>10.4f} {ft_val:>12.4f} {delta:>+10.4f}\")"
185
+ ]
186
+ }
187
+ ],
188
+ "metadata": {
189
+ "kernelspec": {
190
+ "display_name": "Python 3",
191
+ "language": "python",
192
+ "name": "python3"
193
+ },
194
+ "language_info": {
195
+ "name": "python",
196
+ "version": "3.10.0"
197
+ }
198
+ },
199
+ "nbformat": 4,
200
+ "nbformat_minor": 5
201
+ }
src/evaluate.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ src/evaluate.py — ROUGE evaluation: fine-tuned vs. zero-shot baseline on DialogSum test split.
3
+
4
+ Loads the fine-tuned LoRA adapter from HuggingFace Hub and the base model (no adapter),
5
+ runs greedy inference on the 819-example test split, computes ROUGE-1/2/L, and saves
6
+ results to evaluation_results.json.
7
+
8
+ Run on Colab T4:
9
+ python src/evaluate.py
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ import json
15
+ from typing import Any
16
+
17
+ import torch
18
+ from datasets import load_dataset
19
+ from peft import PeftModel
20
+ from rouge_score import rouge_scorer
21
+ from tqdm import tqdm
22
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
23
+
24
+ from src.data import DATASET_NAME, INSTRUCTION
25
+ from src.model import HUB_REPO, MODEL_ID
26
+
27
+ # ---------------------------------------------------------------------------
28
+ # Constants
29
+ # ---------------------------------------------------------------------------
30
+
31
+ BATCH_SIZE = 4
32
+ MAX_NEW_TOKENS = 128
33
+ NUM_QUALITATIVE = 5
34
+ OUTPUT_FILE = "evaluation_results.json"
35
+
36
+
37
+ # ---------------------------------------------------------------------------
38
+ # Prompt formatting (inference only — user turn, no assistant content)
39
+ # ---------------------------------------------------------------------------
40
+
41
+ def format_inference_prompt(dialogue: str, tokenizer: Any) -> str:
42
+ """
43
+ Format a dialogue into an inference prompt (user turn only).
44
+
45
+ Uses add_generation_prompt=True so the model continues with the assistant turn.
46
+ This is the inference-time counterpart of tokenize_and_mask's prompt_text.
47
+
48
+ Args:
49
+ dialogue: Raw conversation string from the dataset.
50
+ tokenizer: Phi-3 tokenizer with apply_chat_template support.
51
+
52
+ Returns:
53
+ Prompt string ending with the assistant generation trigger token.
54
+ """
55
+ messages = [
56
+ {"role": "user", "content": f"{INSTRUCTION}\n\nConversation:\n{dialogue}"}
57
+ ]
58
+ return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
59
+
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Model loading helpers
63
+ # ---------------------------------------------------------------------------
64
+
65
+ def _load_tokenizer(model_id: str = MODEL_ID) -> Any:
66
+ """Load tokenizer with left-padding (required for batched generation)."""
67
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
68
+ tokenizer.padding_side = "left"
69
+ if tokenizer.pad_token is None:
70
+ tokenizer.pad_token = tokenizer.eos_token
71
+ return tokenizer
72
+
73
+
74
+ def _load_base_model(model_id: str = MODEL_ID) -> Any:
75
+ """Load Phi-3-mini in 4-bit quantization without any LoRA adapter."""
76
+ bnb_config = BitsAndBytesConfig(
77
+ load_in_4bit=True,
78
+ bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.float16,
80
+ bnb_4bit_use_double_quant=True,
81
+ )
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ model_id,
84
+ quantization_config=bnb_config,
85
+ device_map="auto",
86
+ trust_remote_code=False,
87
+ dtype=torch.float16,
88
+ )
89
+ model.eval()
90
+ return model
91
+
92
+
93
+ # ---------------------------------------------------------------------------
94
+ # Inference
95
+ # ---------------------------------------------------------------------------
96
+
97
+ def run_inference(
98
+ model: Any,
99
+ tokenizer: Any,
100
+ dialogues: list[str],
101
+ batch_size: int = BATCH_SIZE,
102
+ ) -> list[str]:
103
+ """
104
+ Run batched greedy inference on a list of dialogues.
105
+
106
+ Formats each dialogue into an inference prompt, tokenizes in batches with
107
+ left-padding, generates with max_new_tokens=128 and do_sample=False, then
108
+ strips the prompt prefix from each output to return only the generated summary.
109
+
110
+ Args:
111
+ model: Loaded causal LM (base model or PeftModel).
112
+ tokenizer: Matching tokenizer with padding_side='left'.
113
+ dialogues: List of raw dialogue strings.
114
+ batch_size: Number of examples per forward pass.
115
+
116
+ Returns:
117
+ List of generated summary strings, one per dialogue.
118
+ """
119
+ prompts = [format_inference_prompt(d, tokenizer) for d in dialogues]
120
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
121
+ all_summaries: list[str] = []
122
+
123
+ for i in tqdm(range(0, len(prompts), batch_size), desc="Inferring"):
124
+ batch_prompts = prompts[i : i + batch_size]
125
+ inputs = tokenizer(
126
+ batch_prompts,
127
+ return_tensors="pt",
128
+ padding=True,
129
+ truncation=True,
130
+ max_length=1024,
131
+ )
132
+ inputs = {k: v.to(device) for k, v in inputs.items()}
133
+ input_len = inputs["input_ids"].shape[1]
134
+
135
+ with torch.inference_mode():
136
+ output_ids = model.generate(
137
+ **inputs,
138
+ max_new_tokens=MAX_NEW_TOKENS,
139
+ do_sample=False,
140
+ pad_token_id=tokenizer.pad_token_id,
141
+ )
142
+
143
+ for out in output_ids:
144
+ generated_ids = out[input_len:]
145
+ summary = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
146
+ all_summaries.append(summary)
147
+
148
+ return all_summaries
149
+
150
+
151
+ # ---------------------------------------------------------------------------
152
+ # ROUGE scoring
153
+ # ---------------------------------------------------------------------------
154
+
155
+ def compute_rouge(predictions: list[str], references: list[str]) -> dict[str, float]:
156
+ """
157
+ Compute average ROUGE-1, ROUGE-2, and ROUGE-L F-scores.
158
+
159
+ Args:
160
+ predictions: Generated summaries (one per test example).
161
+ references: Ground-truth summaries from the dataset.
162
+
163
+ Returns:
164
+ Dict with keys 'rouge1', 'rouge2', 'rougeL' — mean F-scores in [0, 1].
165
+ """
166
+ scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
167
+ totals: dict[str, float] = {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
168
+
169
+ for pred, ref in zip(predictions, references):
170
+ scores = scorer.score(ref, pred)
171
+ totals["rouge1"] += scores["rouge1"].fmeasure
172
+ totals["rouge2"] += scores["rouge2"].fmeasure
173
+ totals["rougeL"] += scores["rougeL"].fmeasure
174
+
175
+ n = len(predictions)
176
+ return {k: v / n for k, v in totals.items()}
177
+
178
+
179
+ # ---------------------------------------------------------------------------
180
+ # Qualitative display
181
+ # ---------------------------------------------------------------------------
182
+
183
+ def print_qualitative_examples(
184
+ dialogues: list[str],
185
+ references: list[str],
186
+ finetuned_preds: list[str],
187
+ baseline_preds: list[str],
188
+ n: int = NUM_QUALITATIVE,
189
+ ) -> None:
190
+ """Print n side-by-side examples: dialogue, reference, fine-tuned, baseline."""
191
+ print("\n" + "=" * 80)
192
+ print(f"QUALITATIVE EXAMPLES (n={n})")
193
+ print("=" * 80)
194
+ for i in range(n):
195
+ print(f"\n--- Example {i + 1} ---")
196
+ print(f"[Dialogue]\n{dialogues[i]}\n")
197
+ print(f"[Reference]\n{references[i]}\n")
198
+ print(f"[Fine-tuned]\n{finetuned_preds[i]}\n")
199
+ print(f"[Baseline]\n{baseline_preds[i]}\n")
200
+ print("-" * 60)
201
+
202
+
203
+ # ---------------------------------------------------------------------------
204
+ # Main
205
+ # ---------------------------------------------------------------------------
206
+
207
+ def main() -> None:
208
+ from dotenv import load_dotenv
209
+
210
+ load_dotenv()
211
+
212
+ print("Loading DialogSum test split...")
213
+ test_data = load_dataset(DATASET_NAME, split="test")
214
+ dialogues: list[str] = test_data["dialogue"]
215
+ references: list[str] = test_data["summary"]
216
+ print(f"Test examples: {len(dialogues)}")
217
+
218
+ tokenizer = _load_tokenizer()
219
+
220
+ # --- Fine-tuned model ---
221
+ print(f"\nLoading fine-tuned model from Hub: {HUB_REPO}")
222
+ base_model = _load_base_model()
223
+ finetuned_model = PeftModel.from_pretrained(base_model, HUB_REPO)
224
+ finetuned_model.eval()
225
+
226
+ print("Running fine-tuned inference...")
227
+ finetuned_preds = run_inference(finetuned_model, tokenizer, dialogues)
228
+
229
+ finetuned_rouge = compute_rouge(finetuned_preds, references)
230
+ print("\nFine-tuned ROUGE scores:")
231
+ for k, v in finetuned_rouge.items():
232
+ print(f" {k}: {v:.4f}")
233
+
234
+ # Free GPU memory before loading the baseline
235
+ del finetuned_model
236
+ del base_model
237
+ torch.cuda.empty_cache()
238
+
239
+ # --- Baseline model (no adapter) ---
240
+ print(f"\nLoading baseline model (no adapter): {MODEL_ID}")
241
+ baseline_model = _load_base_model()
242
+
243
+ print("Running baseline inference...")
244
+ baseline_preds = run_inference(baseline_model, tokenizer, dialogues)
245
+
246
+ baseline_rouge = compute_rouge(baseline_preds, references)
247
+ print("\nBaseline ROUGE scores:")
248
+ for k, v in baseline_rouge.items():
249
+ print(f" {k}: {v:.4f}")
250
+
251
+ del baseline_model
252
+ torch.cuda.empty_cache()
253
+
254
+ # --- Results table ---
255
+ print("\n" + "=" * 52)
256
+ print(f"{'Metric':<12} {'Baseline':>10} {'Fine-tuned':>12} {'Delta':>10}")
257
+ print("-" * 52)
258
+ for k in ["rouge1", "rouge2", "rougeL"]:
259
+ base_val = baseline_rouge[k]
260
+ ft_val = finetuned_rouge[k]
261
+ delta = ft_val - base_val
262
+ print(f"{k:<12} {base_val:>10.4f} {ft_val:>12.4f} {delta:>+10.4f}")
263
+ print("=" * 52)
264
+
265
+ # --- Save results ---
266
+ results = {
267
+ "fine_tuned": finetuned_rouge,
268
+ "baseline": baseline_rouge,
269
+ }
270
+ with open(OUTPUT_FILE, "w") as f:
271
+ json.dump(results, f, indent=2)
272
+ print(f"\nSaved results to {OUTPUT_FILE}")
273
+
274
+ # --- Qualitative examples ---
275
+ print_qualitative_examples(dialogues, references, finetuned_preds, baseline_preds)
276
+
277
+
278
+ if __name__ == "__main__":
279
+ main()