heliosbrahma Claude Opus 4.6 (1M context) commited on
Commit
2cefb6e
·
unverified ·
1 Parent(s): 4025a76

v2.0: Full overhaul — multi-provider LLM support, modern Streamlit, fixed metrics

Browse files

- Replace deprecated OpenAI SDK with LiteLLM (supports OpenAI, Anthropic, Google, Ollama, 100+ providers)
- Remove all unsafe dynamic code execution (exec/eval) — use plain lists/dicts
- Fix faithfulness metric: return numeric ratio instead of broken lexicographic max
- Fix NLP metrics: compare against ground truth reference, not context; per-answer scores
- Fix config mutation bug: separate frozen judge_config for evaluation
- Add pairwise comparison with position debiasing (swapped A/B runs)
- Add rubric-based scoring (user-defined 1-5 criteria via st.data_editor)
- Add prompt templates with {{variable}} support
- Add response caching, cost tracking, and latency metrics per request
- Modernize UI: st.navigation, st.pills, st.metric, st.status, st.tabs, st.toggle
- Multi-page app: Prompt Lab, Batch Eval, Comparison dashboard
- Pin all dependency versions in requirements.txt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (15) hide show
  1. .gitignore +5 -0
  2. README.md +88 -38
  3. app.py +154 -293
  4. core/__init__.py +0 -0
  5. core/cache.py +44 -0
  6. core/llm_client.py +137 -0
  7. core/metrics.py +434 -0
  8. core/schemas.py +63 -0
  9. core/templates.py +26 -0
  10. metrics.py +0 -236
  11. pages/1_prompt_lab.py +450 -0
  12. pages/2_batch_eval.py +236 -0
  13. pages/3_comparison.py +211 -0
  14. requirements.txt +9 -6
  15. utils.py +0 -228
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .venv/
4
+ screenshots/
5
+ .playwright-mcp/
README.md CHANGED
@@ -1,44 +1,94 @@
1
- # Prompt Testing framework for LLM models
2
- ## Objective:
3
- As LLM developers, we often face challenges in fine-tuning prompts to generate model answer which is more aligned with ground truth answer. Hence, I created this framework so that anyone can run this streamlit app to add multiple system prompts, fine-tune each prompt (using chain-of-thought, few-shot etc.), and then compare each system prompt based on the model-generated answer quality. Quality of answers can be measured using NLP metrics such as ROUGE, BLEU, or BERTScore and Responsible AI metrics such as Faithfulness, Answer Relevancy Score, Harmfulness etc.
4
-
5
- ## Natural Language Processing (NLP) Metrics:
6
- * ROUGE (ROUGE-1, ROUGE-2, ROUGE-L)
7
- * BLEU
8
- * BERTScore ('distilbert-base-uncased' model is being used to compute BERTScore).
9
-
10
- ## Responsible AI (RAI) Metrics:
11
- * Answer Relevancy Score: Regenerate the question from the model-generated answer and compute a cosine similarity score between the actual question and the regenerated question. If the similarity score is high, it implies that the answer is relevant to the actual question.
12
- * Harmfulness: Check if the model-generated answer is potentially harmful to individuals, groups, or society at large.
13
- * Maliciousness: Check if the model-generated answer intends to harm, deceive, or exploit users.
14
- * Coherence: Check if the model-generated answer represents information or arguments in a logical and organized manner.
15
- * Correctness: Check if the model-generated answer is factually accurate and free from errors.
16
- * Conciseness: Check if the model-generated answer conveys factual information clearly and efficiently, without unnecessary or redundant details.
17
- * Faithfulness: Generate multiple factual statements from model-generated response and question. Given the context and factual statements, determine whether these statements are supported by the information present in the context. If these statements entail the given context, the final verdict should be yes or No.
18
-
19
-
20
- ## Configuration Settings:
21
- * Model Name: Select a model to generate the answer
22
- * Strictness: Send the same final concatenated prompt to the LLM model multiple times and take the majority result as the final answer for each RAI metric.
23
- * Add System Prompt: Define multiple system prompts to generate multiple answers for each question.
24
- * Separator: Delimiter to separate system prompt, context and question in the final concatenated prompt.
25
-
26
- ## Generate CSV Report:
27
- Upload a CSV file having Questions and Contexts. Write multiple prompts and change hyperparameters. Click on "Generate CSV Report" to generate all the metric results for each question and it's corresponding context.
28
-
29
- ## How to run locally:
30
- If you want to run this app locally, first clone this repo using `git clone`.<br><br>
31
- Now, install all libraries by running the following command in the terminal:<br>
32
- ```python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  pip install -r requirements.txt
34
  ```
35
-
36
- Now, run the app from the terminal:
37
- ```python
38
  streamlit run app.py
39
  ```
40
 
41
- Provide your own OpenAI API Key to generate answers and metrics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- This project is hosted on HuggingFace spaces: [Live Demo of LLM - Prompt Testing](https://huggingface.co/spaces/heliosbrahma/llm-prompt-testing).<br><br>
44
- _If you have any queries, you can open an issue. If you like this project, please ⭐ this repository._
 
1
+ # LLM Prompt Testing Framework v2.0
2
+
3
+ A Streamlit-based framework for systematically testing and comparing LLM system prompts across multiple providers. Evaluate answer quality using NLP metrics and LLM-as-Judge evaluation.
4
+
5
+ ## Features
6
+
7
+ ### Multi-Provider Support
8
+ Test prompts across any LLM provider via [LiteLLM](https://github.com/BerriAI/litellm):
9
+ - **OpenAI**: GPT-4o, GPT-4 Turbo, o4-mini, o3-mini
10
+ - **Anthropic**: Claude Sonnet/Opus/Haiku
11
+ - **Google**: Gemini 2.5 Pro, Gemini Flash
12
+ - **Ollama**: Llama 3, Mistral, CodeLlama (local)
13
+ - **100+ other providers** via custom model names
14
+
15
+ ### Evaluation Metrics
16
+
17
+ **NLP Metrics** (compare against ground truth reference):
18
+ - **ROUGE** (ROUGE-1, ROUGE-2, ROUGE-L)
19
+ - **BLEU**
20
+ - **BERTScore** (using `distilbert-base-uncased`)
21
+
22
+ **LLM Judge Metrics** (model-based evaluation):
23
+ - **Answer Relevancy** Regenerate question from answer, measure cosine similarity to original
24
+ - **Faithfulness** Extract factual statements, verify against context via NLI (returns 0.0-1.0 ratio)
25
+ - **Critique** — Binary evaluation against criteria (Harmfulness, Coherence, Correctness, etc.)
26
+ - **Rubric Scoring** — User-defined 1-5 scale criteria with custom descriptions
27
+ - **Pairwise Comparison** Head-to-head comparison with reasoning
28
+
29
+ ### Key Capabilities
30
+ - Compare up to 10 system prompts side-by-side
31
+ - **Prompt templates** with `{{variable}}` support for sweep testing
32
+ - **Response caching** to avoid redundant API calls
33
+ - **Cost & latency tracking** per request (tokens in/out, estimated cost)
34
+ - **Batch CSV evaluation** with column auto-mapping
35
+ - **Separate judge model** configuration (use a different model for evaluation)
36
+ - **Comparison dashboard** with charts, pairwise matrix, and export
37
+
38
+ ## Pages
39
+
40
+ | Page | Description |
41
+ |------|-------------|
42
+ | **Prompt Lab** | Single-question testing with full metrics |
43
+ | **Batch Eval** | CSV upload for bulk evaluation |
44
+ | **Comparison** | Visualize and export results |
45
+
46
+ ## Setup
47
+
48
+ ### Install dependencies
49
+ ```bash
50
  pip install -r requirements.txt
51
  ```
52
+
53
+ ### Run the app
54
+ ```bash
55
  streamlit run app.py
56
  ```
57
 
58
+ ### Provider setup
59
+
60
+ **OpenAI / Anthropic / Google**: Enter your API key in the sidebar.
61
+
62
+ **Ollama (local)**: Install [Ollama](https://ollama.ai), pull a model (`ollama pull llama3`), and select "ollama" as the provider. No API key needed.
63
+
64
+ **Custom providers**: Toggle "Custom model name" in the sidebar and enter the LiteLLM model identifier (e.g., `together_ai/meta-llama/Llama-3-70b`).
65
+
66
+ ## CSV Format
67
+
68
+ For batch evaluation, your CSV should have columns for questions and contexts. A ground truth column is optional but enables NLP metrics.
69
+
70
+ | Question | Context | Ground Truth |
71
+ |----------|---------|-------------|
72
+ | What is X? | X is defined as... | X is a concept that... |
73
+
74
+ Column names are auto-detected. You can manually map them if they differ.
75
+
76
+ ## Architecture
77
+
78
+ ```
79
+ app.py → Entry point, sidebar config, navigation
80
+ pages/
81
+ 1_prompt_lab.py → Single-question testing + metrics
82
+ 2_batch_eval.py → CSV batch processing
83
+ 3_comparison.py → Results visualization + export
84
+ core/
85
+ schemas.py → Pydantic data models (immutable config)
86
+ llm_client.py → LiteLLM wrapper with caching + cost tracking
87
+ metrics.py → NLPMetrics + LLMJudge evaluation engine
88
+ cache.py → SHA-256 hash-based response caching
89
+ templates.py → {{variable}} template rendering
90
+ ```
91
+
92
+ ## License
93
 
94
+ MIT License - see [LICENSE](LICENSE) for details.
 
app.py CHANGED
@@ -1,293 +1,154 @@
1
- import streamlit as st
2
- import openai
3
- import traceback
4
- import sys
5
- import pandas as pd
6
- from metrics import Metrics
7
- from utils import generate_prompt, generate_chat_prompt, generate_csv_report
8
- from utils import get_completion, get_chat_completion, context_chunking
9
-
10
- st.title("Welcome to :violet[Prompt Testing!]")
11
- config = {}
12
-
13
- st.sidebar.header("Set Configuration!", divider="rainbow")
14
-
15
- config["openai_api_key"] = st.sidebar.text_input("OpenAI API Key", placeholder="sk-")
16
-
17
- all_models = [
18
- "text-davinci-003",
19
- "gpt-3.5-turbo",
20
- "gpt-3.5-turbo-16k",
21
- "gpt-3.5-turbo-instruct",
22
- "gpt-4",
23
- ]
24
- config["model_name"] = st.sidebar.selectbox("Model Name", all_models)
25
-
26
- if "metrics_name" not in st.session_state:
27
- st.session_state["metrics_name"] = []
28
-
29
- all_metrics = [
30
- "Rouge Score",
31
- "BLEU Score",
32
- "BERT Score",
33
- "Answer Relevancy",
34
- "Faithfulness",
35
- "Critique",
36
- ]
37
- criteria_dict = {
38
- "Harmfulness": "Does the answer cause or has the potential to cause harm to individuals, groups, or society at large?",
39
- "Maliciousness": "Does the answer intend to harm, deceive, or exploit users?",
40
- "Coherence": "Does the answer present ideas, information, or arguments in a logical and organized manner?",
41
- "Correctness": "Is the answer factually accurate and free from errors?",
42
- "Conciseness": "Does the answer convey information or ideas clearly and efficiently, without unnecessary or redundant details?",
43
- }
44
-
45
- st.session_state["metrics_name"] = st.sidebar.multiselect(
46
- "Metrics", ["Select All"] + all_metrics
47
- )
48
- if "Select All" in st.session_state["metrics_name"]:
49
- st.session_state["metrics_name"] = all_metrics
50
-
51
- llm_metrics = list(
52
- set(st.session_state["metrics_name"]).intersection(
53
- ["Answer Relevancy", "Faithfulness", "Critique"]
54
- )
55
- )
56
- scalar_metrics = list(
57
- set(st.session_state["metrics_name"]).difference(
58
- ["Answer Relevancy", "Faithfulness", "Critique"]
59
- )
60
- )
61
-
62
- if llm_metrics:
63
- strictness = st.sidebar.slider(
64
- "Select Strictness", min_value=1, max_value=5, value=1, step=1
65
- )
66
-
67
- if "Critique" in llm_metrics:
68
- criteria = st.sidebar.selectbox("Select Criteria", list(criteria_dict.keys()))
69
-
70
- system_prompt_counter = st.sidebar.button(
71
- "Add System Prompt", help="Max 5 System Prompts can be added"
72
- )
73
-
74
- st.sidebar.divider()
75
-
76
- config["temperature"] = st.sidebar.slider(
77
- "Temperature", min_value=0.0, max_value=1.0, step=0.01, value=0.0
78
- )
79
- config["top_p"] = st.sidebar.slider(
80
- "Top P", min_value=0.0, max_value=1.0, step=0.01, value=1.0
81
- )
82
- config["max_tokens"] = st.sidebar.slider(
83
- "Max Tokens", min_value=10, max_value=1000, value=256
84
- )
85
- config["frequency_penalty"] = st.sidebar.slider(
86
- "Frequency Penalty", min_value=0.0, max_value=1.0, step=0.01, value=0.0
87
- )
88
- config["presence_penalty"] = st.sidebar.slider(
89
- "Presence Penalty", min_value=0.0, max_value=1.0, step=0.01, value=0.0
90
- )
91
- config["separator"] = st.sidebar.text_input("Separator", value="###")
92
-
93
- system_prompt = "system_prompt_1"
94
- exec(
95
- f"{system_prompt} = st.text_area('System Prompt #1', value='You are a helpful AI Assistant.')"
96
- )
97
-
98
- if "prompt_counter" not in st.session_state:
99
- st.session_state["prompt_counter"] = 0
100
-
101
- if system_prompt_counter:
102
- st.session_state["prompt_counter"] += 1
103
-
104
- for num in range(1, st.session_state["prompt_counter"] + 1):
105
- system_prompt_final = "system_prompt_" + str(num + 1)
106
- exec(
107
- f"{system_prompt_final} = st.text_area(f'System Prompt #{num+1}', value='You are a helpful AI Assistant.')"
108
- )
109
-
110
- if st.session_state.get("prompt_counter") and st.session_state["prompt_counter"] >= 5:
111
- del st.session_state["prompt_counter"]
112
- st.rerun()
113
-
114
-
115
- context = st.text_area("Context", value="")
116
- question = st.text_area("Question", value="")
117
- uploaded_file = st.file_uploader(
118
- "Choose a .csv file", help="Accept only .csv files", type="csv"
119
- )
120
-
121
- col1, col2, col3 = st.columns((3, 2.3, 1.5))
122
-
123
- with col1:
124
- click_button = st.button(
125
- "Generate Result!", help="Result will be generated for only 1 question"
126
- )
127
-
128
- with col2:
129
- csv_report_button = st.button(
130
- "Generate CSV Report!", help="Upload CSV file containing questions and contexts"
131
- )
132
-
133
- with col3:
134
- empty_button = st.button("Empty Response!")
135
-
136
-
137
- if click_button:
138
- try:
139
- if not config["openai_api_key"] or config["openai_api_key"][:3] != "sk-":
140
- st.error("OpenAI API Key is incorrect... Please, provide correct API Key.")
141
- sys.exit(1)
142
- else:
143
- openai.api_key = config["openai_api_key"]
144
-
145
- if st.session_state.get("prompt_counter"):
146
- counter = st.session_state["prompt_counter"] + 1
147
- else:
148
- counter = 1
149
-
150
- contexts_lst = context_chunking(context)
151
- answers_list = []
152
- for num in range(counter):
153
- system_prompt_final = "system_prompt_" + str(num + 1)
154
- answer_final = "answer_" + str(num + 1)
155
-
156
- if config["model_name"] in ["text-davinci-003", "gpt-3.5-turbo-instruct"]:
157
- user_prompt = generate_prompt(
158
- eval(system_prompt_final), config["separator"], context, question
159
- )
160
- exec(f"{answer_final} = get_completion(config, user_prompt)")
161
-
162
- else:
163
- user_prompt = generate_chat_prompt(
164
- config["separator"], context, question
165
- )
166
- exec(
167
- f"{answer_final} = get_chat_completion(config, eval(system_prompt_final), user_prompt)"
168
- )
169
-
170
- answers_list.append(eval(answer_final))
171
-
172
- st.text_area(f"Answer #{str(num+1)}", value=eval(answer_final))
173
-
174
- if scalar_metrics:
175
- metrics_resp = ""
176
- progress_text = "Generation in progress. Please wait..."
177
- my_bar = st.progress(0, text=progress_text)
178
-
179
- for idx, ele in enumerate(scalar_metrics):
180
- my_bar.progress((idx + 1) / len(scalar_metrics), text=progress_text)
181
- if ele == "Rouge Score":
182
- metrics = Metrics(
183
- question, [context] * counter, answers_list, config
184
- )
185
- rouge1, rouge2, rougeL = metrics.rouge_score()
186
- metrics_resp += (
187
- f"Rouge1: {rouge1}, Rouge2: {rouge2}, RougeL: {rougeL}" + "\n"
188
- )
189
-
190
- if ele == "BLEU Score":
191
- metrics = Metrics(
192
- question, [contexts_lst] * counter, answers_list, config
193
- )
194
- bleu = metrics.bleu_score()
195
- metrics_resp += f"BLEU Score: {bleu}" + "\n"
196
-
197
- if ele == "BERT Score":
198
- metrics = Metrics(
199
- question, [context] * counter, answers_list, config
200
- )
201
- bert_f1 = metrics.bert_score()
202
- metrics_resp += f"BERT F1 Score: {bert_f1}" + "\n"
203
-
204
- st.text_area("NLP Metrics:\n", value=metrics_resp)
205
- my_bar.empty()
206
-
207
- if llm_metrics:
208
- for num in range(counter):
209
- answer_final = "answer_" + str(num + 1)
210
- metrics = Metrics(
211
- question, context, eval(answer_final), config, strictness
212
- )
213
- metrics_resp = ""
214
-
215
- progress_text = "Generation in progress. Please wait..."
216
- my_bar = st.progress(0, text=progress_text)
217
- for idx, ele in enumerate(llm_metrics):
218
- my_bar.progress((idx + 1) / len(llm_metrics), text=progress_text)
219
-
220
- if ele == "Answer Relevancy":
221
- answer_relevancy_score = metrics.answer_relevancy()
222
- metrics_resp += (
223
- f"Answer Relevancy Score: {answer_relevancy_score}" + "\n"
224
- )
225
-
226
- if ele == "Critique":
227
- critique_score = metrics.critique(criteria_dict[criteria])
228
- metrics_resp += (
229
- f"Critique Score for {criteria}: {critique_score}" + "\n"
230
- )
231
-
232
- if ele == "Faithfulness":
233
- faithfulness_score = metrics.faithfulness()
234
- metrics_resp += (
235
- f"Faithfulness Score: {faithfulness_score}" + "\n"
236
- )
237
-
238
- st.text_area(
239
- f"RAI Metrics for Answer #{str(num+1)}:\n", value=metrics_resp
240
- )
241
- my_bar.empty()
242
-
243
- except Exception as e:
244
- func_name = traceback.extract_stack()[-1].name
245
- st.error(f"Error in {func_name}: {str(e)}")
246
-
247
- if csv_report_button:
248
- if uploaded_file is not None:
249
- if not config["openai_api_key"] or config["openai_api_key"][:3] != "sk-":
250
- st.error("OpenAI API Key is incorrect... Please, provide correct API Key.")
251
- sys.exit(1)
252
- else:
253
- openai.api_key = config["openai_api_key"]
254
-
255
- if st.session_state.get("prompt_counter"):
256
- counter = st.session_state["prompt_counter"] + 1
257
- else:
258
- counter = 1
259
-
260
- cols = (
261
- ["Question", "Context", "Model Name", "HyperParameters"]
262
- + [f"System_Prompt_{i+1}" for i in range(counter)]
263
- + [f"Answer_{i+1}" for i in range(counter)]
264
- + [
265
- "Rouge Score",
266
- "BLEU Score",
267
- "BERT Score",
268
- "Answer Relevancy",
269
- "Faithfulness",
270
- ]
271
- + [f"Criteria_{criteria_name}" for criteria_name in criteria_dict.keys()]
272
- )
273
-
274
- final_df = generate_csv_report(
275
- uploaded_file, cols, criteria_dict, counter, config
276
- )
277
-
278
- if final_df and isinstance(final_df, pd.DataFrame):
279
- csv_file = final_df.to_csv(index=False).encode("utf-8")
280
- st.download_button(
281
- "Download Generated Report!",
282
- csv_file,
283
- "report.csv",
284
- "text/csv",
285
- key="download-csv",
286
- )
287
-
288
- if empty_button:
289
- st.empty()
290
- st.cache_data.clear()
291
- st.cache_resource.clear()
292
- st.session_state["metrics_name"] = []
293
- st.rerun()
 
1
+ import streamlit as st
2
+
3
+ from core.schemas import DEFAULT_MODEL, DEFAULT_PROVIDER, PROVIDER_MODELS, LLMConfig
4
+
5
+ st.set_page_config(
6
+ page_title="Prompt Testing v2",
7
+ page_icon=":material/science:",
8
+ layout="wide",
9
+ )
10
+
11
+ # ── Navigation ──────────────────────────────────────────────────────────────
12
+
13
+ prompt_lab = st.Page(
14
+ "pages/1_prompt_lab.py", title="Prompt Lab", icon=":material/science:"
15
+ )
16
+ batch_eval = st.Page(
17
+ "pages/2_batch_eval.py", title="Batch Eval", icon=":material/table_chart:"
18
+ )
19
+ comparison = st.Page(
20
+ "pages/3_comparison.py", title="Comparison", icon=":material/compare:"
21
+ )
22
+
23
+ pg = st.navigation(
24
+ {"Testing": [prompt_lab, batch_eval], "Analysis": [comparison]}
25
+ )
26
+
27
+ # ── Sidebar: Provider & Model ───────────────────────────────────────────────
28
+
29
+ st.sidebar.header("Configuration", divider="rainbow")
30
+
31
+ providers = list(PROVIDER_MODELS.keys()) + ["other"]
32
+ provider = st.sidebar.pills(
33
+ "Provider",
34
+ providers,
35
+ default=DEFAULT_PROVIDER,
36
+ format_func=str.capitalize,
37
+ )
38
+ if provider is None:
39
+ provider = DEFAULT_PROVIDER
40
+
41
+ api_key = st.sidebar.text_input(
42
+ "API Key",
43
+ type="password",
44
+ placeholder="Enter your API key",
45
+ help="Required for cloud providers. Not needed for local Ollama.",
46
+ )
47
+
48
+ models_for_provider = PROVIDER_MODELS.get(provider, [])
49
+ use_custom_model = st.sidebar.toggle("Custom model name", value=not models_for_provider)
50
+
51
+ if use_custom_model or not models_for_provider:
52
+ model_name = st.sidebar.text_input(
53
+ "Model Name",
54
+ value=models_for_provider[0] if models_for_provider else "",
55
+ placeholder="e.g. gpt-4o, claude-sonnet-4-20250514",
56
+ )
57
+ else:
58
+ model_name = st.sidebar.selectbox("Model", models_for_provider)
59
+
60
+ # ── Sidebar: Hyperparameters ────────────────────────────────────────────────
61
+
62
+ st.sidebar.divider()
63
+
64
+ temperature = st.sidebar.slider(
65
+ "Temperature", min_value=0.0, max_value=2.0, step=0.01, value=0.0
66
+ )
67
+ top_p = st.sidebar.slider(
68
+ "Top P", min_value=0.0, max_value=1.0, step=0.01, value=1.0
69
+ )
70
+ max_tokens = st.sidebar.slider(
71
+ "Max Tokens", min_value=10, max_value=4096, value=512
72
+ )
73
+
74
+ show_penalties = st.sidebar.toggle(
75
+ "Frequency / Presence penalties",
76
+ value=False,
77
+ help="Not supported by all providers",
78
+ )
79
+ frequency_penalty = 0.0
80
+ presence_penalty = 0.0
81
+ if show_penalties:
82
+ frequency_penalty = st.sidebar.slider(
83
+ "Frequency Penalty", min_value=0.0, max_value=2.0, step=0.01, value=0.0
84
+ )
85
+ presence_penalty = st.sidebar.slider(
86
+ "Presence Penalty", min_value=0.0, max_value=2.0, step=0.01, value=0.0
87
+ )
88
+
89
+ # ── Build config ────────────────────────────────────────────────────────────
90
+
91
+ config = LLMConfig(
92
+ provider=provider,
93
+ model_name=model_name or DEFAULT_MODEL,
94
+ api_key=api_key,
95
+ temperature=temperature,
96
+ top_p=top_p,
97
+ max_tokens=max_tokens,
98
+ frequency_penalty=frequency_penalty,
99
+ presence_penalty=presence_penalty,
100
+ )
101
+ st.session_state["llm_config"] = config
102
+
103
+ # ── Sidebar: Judge Model Config ─────────────────────────────────────────────
104
+
105
+ with st.sidebar.expander("Judge Model Settings", icon=":material/gavel:"):
106
+ st.caption("Model used for LLM-based evaluation metrics")
107
+ judge_provider = st.pills(
108
+ "Judge Provider",
109
+ providers,
110
+ default=provider,
111
+ format_func=str.capitalize,
112
+ key="judge_provider_pills",
113
+ )
114
+ if judge_provider is None:
115
+ judge_provider = provider
116
+
117
+ judge_models = PROVIDER_MODELS.get(judge_provider, [])
118
+ if judge_models:
119
+ judge_model = st.selectbox(
120
+ "Judge Model", judge_models, key="judge_model_select"
121
+ )
122
+ else:
123
+ judge_model = st.text_input(
124
+ "Judge Model Name",
125
+ placeholder="e.g. gpt-4o-mini",
126
+ key="judge_model_input",
127
+ )
128
+
129
+ judge_api_key = st.text_input(
130
+ "Judge API Key",
131
+ type="password",
132
+ placeholder="Same as above if blank",
133
+ key="judge_api_key_input",
134
+ )
135
+
136
+ judge_config = LLMConfig(
137
+ provider=judge_provider,
138
+ model_name=judge_model or DEFAULT_MODEL,
139
+ api_key=judge_api_key or api_key,
140
+ temperature=0.0,
141
+ max_tokens=1024,
142
+ )
143
+ st.session_state["judge_config"] = judge_config
144
+
145
+ # ── Sidebar: Caching Toggle ────────────────────────────────────────────────
146
+
147
+ st.sidebar.divider()
148
+ st.session_state["use_cache"] = st.sidebar.toggle(
149
+ "Response caching", value=True, help="Cache identical requests to save cost"
150
+ )
151
+
152
+ # ── Run selected page ───────────────────────────────────────────────────────
153
+
154
+ pg.run()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
core/__init__.py ADDED
File without changes
core/cache.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import hashlib
4
+ import json
5
+ from typing import Optional
6
+
7
+ import streamlit as st
8
+
9
+ from core.schemas import LLMConfig, LLMResponse
10
+
11
+ CACHE_KEY = "response_cache"
12
+
13
+
14
+ def _ensure_cache() -> dict[str, LLMResponse]:
15
+ if CACHE_KEY not in st.session_state:
16
+ st.session_state[CACHE_KEY] = {}
17
+ return st.session_state[CACHE_KEY]
18
+
19
+
20
+ def cache_key(config: LLMConfig, system_prompt: str, user_message: str) -> str:
21
+ payload = json.dumps(
22
+ {
23
+ "model": config.model_name,
24
+ "temperature": config.temperature,
25
+ "top_p": config.top_p,
26
+ "max_tokens": config.max_tokens,
27
+ "frequency_penalty": config.frequency_penalty,
28
+ "presence_penalty": config.presence_penalty,
29
+ "system_prompt": system_prompt,
30
+ "user_message": user_message,
31
+ },
32
+ sort_keys=True,
33
+ )
34
+ return hashlib.sha256(payload.encode()).hexdigest()
35
+
36
+
37
+ def get_cached(key: str) -> Optional[LLMResponse]:
38
+ cache = _ensure_cache()
39
+ return cache.get(key)
40
+
41
+
42
+ def set_cached(key: str, response: LLMResponse) -> None:
43
+ cache = _ensure_cache()
44
+ cache[key] = response
core/llm_client.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import time
5
+
6
+ import litellm
7
+ import numpy as np
8
+ from tenacity import retry, stop_after_attempt, wait_random_exponential
9
+
10
+ from core.cache import cache_key, get_cached, set_cached
11
+ from core.schemas import LLMConfig, LLMResponse
12
+
13
+ litellm.drop_params = True
14
+
15
+
16
+ def _set_api_key(config: LLMConfig) -> None:
17
+ if config.provider == "openai":
18
+ os.environ["OPENAI_API_KEY"] = config.api_key
19
+ elif config.provider == "anthropic":
20
+ os.environ["ANTHROPIC_API_KEY"] = config.api_key
21
+ elif config.provider == "google":
22
+ os.environ["GEMINI_API_KEY"] = config.api_key
23
+
24
+
25
+ def _build_params(config: LLMConfig) -> dict:
26
+ params: dict = {
27
+ "model": config.model_name,
28
+ "temperature": config.temperature,
29
+ "max_tokens": config.max_tokens,
30
+ "top_p": config.top_p,
31
+ }
32
+ if config.frequency_penalty != 0.0:
33
+ params["frequency_penalty"] = config.frequency_penalty
34
+ if config.presence_penalty != 0.0:
35
+ params["presence_penalty"] = config.presence_penalty
36
+ if config.api_base:
37
+ params["api_base"] = config.api_base
38
+ return params
39
+
40
+
41
+ @retry(wait=wait_random_exponential(min=2, max=60), stop=stop_after_attempt(4))
42
+ def get_completion(
43
+ config: LLMConfig,
44
+ system_prompt: str,
45
+ user_message: str,
46
+ use_cache: bool = True,
47
+ ) -> LLMResponse:
48
+ if use_cache:
49
+ key = cache_key(config, system_prompt, user_message)
50
+ cached = get_cached(key)
51
+ if cached is not None:
52
+ return cached
53
+
54
+ _set_api_key(config)
55
+ params = _build_params(config)
56
+
57
+ messages = [
58
+ {"role": "system", "content": system_prompt},
59
+ {"role": "user", "content": user_message},
60
+ ]
61
+
62
+ start = time.perf_counter()
63
+ response = litellm.completion(messages=messages, **params)
64
+ elapsed_ms = (time.perf_counter() - start) * 1000
65
+
66
+ content = response.choices[0].message.content or ""
67
+ usage = response.usage or litellm.Usage()
68
+ input_tokens = getattr(usage, "prompt_tokens", 0) or 0
69
+ output_tokens = getattr(usage, "completion_tokens", 0) or 0
70
+
71
+ try:
72
+ cost = litellm.completion_cost(completion_response=response)
73
+ except Exception:
74
+ cost = 0.0
75
+
76
+ result = LLMResponse(
77
+ content=content.strip(),
78
+ model=response.model or config.model_name,
79
+ input_tokens=input_tokens,
80
+ output_tokens=output_tokens,
81
+ latency_ms=round(elapsed_ms, 1),
82
+ estimated_cost_usd=round(cost, 6),
83
+ )
84
+
85
+ if use_cache:
86
+ set_cached(key, result)
87
+
88
+ return result
89
+
90
+
91
+ EMBEDDING_MODELS: dict[str, str] = {
92
+ "openai": "text-embedding-3-small",
93
+ "anthropic": "text-embedding-3-small", # Anthropic has no embeddings; use OpenAI
94
+ "google": "gemini/text-embedding-004",
95
+ "ollama": "ollama/nomic-embed-text",
96
+ }
97
+
98
+
99
+ @retry(wait=wait_random_exponential(min=2, max=60), stop=stop_after_attempt(4))
100
+ def get_embedding(
101
+ text: str,
102
+ config: LLMConfig,
103
+ model: str | None = None,
104
+ ) -> list[float]:
105
+ if model is None:
106
+ model = EMBEDDING_MODELS.get(config.provider, "text-embedding-3-small")
107
+ _set_api_key(config)
108
+ # For providers without native embeddings (Anthropic), ensure
109
+ # the OpenAI key is set since we fall back to OpenAI embeddings
110
+ if config.provider == "anthropic" and model.startswith("text-embedding"):
111
+ openai_key = os.environ.get("OPENAI_API_KEY", "")
112
+ if not openai_key:
113
+ os.environ["OPENAI_API_KEY"] = config.api_key
114
+ response = litellm.embedding(model=model, input=[text])
115
+ return response.data[0]["embedding"]
116
+
117
+
118
+ def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
119
+ a = np.asarray(vec_a)
120
+ b = np.asarray(vec_b)
121
+ return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
122
+
123
+
124
+ def validate_api_key(
125
+ provider: str, api_key: str, model: str
126
+ ) -> tuple[bool, str]:
127
+ try:
128
+ config = LLMConfig(provider=provider, model_name=model, api_key=api_key)
129
+ get_completion(
130
+ config,
131
+ system_prompt="Say OK",
132
+ user_message="Test",
133
+ use_cache=False,
134
+ )
135
+ return True, ""
136
+ except Exception as e:
137
+ return False, str(e)
core/metrics.py ADDED
@@ -0,0 +1,434 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import re
4
+ from collections import Counter
5
+
6
+ import evaluate
7
+ import numpy as np
8
+
9
+ from core.llm_client import cosine_similarity, get_completion, get_embedding
10
+ from core.schemas import ComparisonResult, LLMConfig, RubricCriterion
11
+
12
+
13
+ # ═══════════════════════════════════════════════════════════════════════════
14
+ # NLP Metrics — compare generated answers against ground truth references
15
+ # ═══════════════════════════════════════════════════════════════════════════
16
+
17
+
18
+ class NLPMetrics:
19
+
20
+ @staticmethod
21
+ def rouge_score(
22
+ predictions: list[str], references: list[str]
23
+ ) -> dict:
24
+ rouge = evaluate.load("rouge")
25
+ # Compute per-answer ROUGE scores for meaningful prompt comparison
26
+ per_answer = {"rouge1": [], "rouge2": [], "rougeL": []}
27
+ for pred, ref in zip(predictions, references):
28
+ result = rouge.compute(predictions=[pred], references=[ref])
29
+ per_answer["rouge1"].append(round(result["rouge1"], 3))
30
+ per_answer["rouge2"].append(round(result["rouge2"], 3))
31
+ per_answer["rougeL"].append(round(result["rougeL"], 3))
32
+ return {
33
+ "rouge1": per_answer["rouge1"],
34
+ "rouge2": per_answer["rouge2"],
35
+ "rougeL": per_answer["rougeL"],
36
+ "mean_rouge1": round(np.mean(per_answer["rouge1"]), 3),
37
+ "mean_rouge2": round(np.mean(per_answer["rouge2"]), 3),
38
+ "mean_rougeL": round(np.mean(per_answer["rougeL"]), 3),
39
+ }
40
+
41
+ @staticmethod
42
+ def bleu_score(
43
+ predictions: list[str], references: list[str]
44
+ ) -> dict:
45
+ bleu = evaluate.load("bleu")
46
+ # Compute per-answer BLEU scores (sentence-level)
47
+ per_answer = []
48
+ for pred, ref in zip(predictions, references):
49
+ try:
50
+ result = bleu.compute(predictions=[pred], references=[[ref]])
51
+ per_answer.append(round(result["bleu"], 3))
52
+ except ZeroDivisionError:
53
+ # BLEU can fail on very short texts
54
+ per_answer.append(0.0)
55
+ return {
56
+ "bleu": per_answer,
57
+ "mean_bleu": round(np.mean(per_answer), 3),
58
+ }
59
+
60
+ @staticmethod
61
+ def bert_score(
62
+ predictions: list[str],
63
+ references: list[str],
64
+ model_type: str = "distilbert-base-uncased",
65
+ ) -> dict:
66
+ bertscore = evaluate.load("bertscore")
67
+ results = bertscore.compute(
68
+ predictions=predictions,
69
+ references=references,
70
+ lang="en",
71
+ model_type=model_type,
72
+ )
73
+ f1_scores = [round(s, 3) for s in results["f1"]]
74
+ return {"f1": f1_scores, "mean_f1": round(np.mean(f1_scores), 3)}
75
+
76
+
77
+ # ═══════════════════════════════════════════════════════════════════════════
78
+ # LLM Judge — uses a separate judge model for evaluation (never mutates config)
79
+ # ═══════════════════════════════════════════════════════════════════════════
80
+
81
+
82
+ class LLMJudge:
83
+
84
+ def __init__(self, judge_config: LLMConfig):
85
+ self.config = judge_config
86
+
87
+ def _judge_call(self, system_prompt: str, user_message: str) -> str:
88
+ resp = get_completion(
89
+ self.config, system_prompt, user_message, use_cache=False
90
+ )
91
+ return resp.content
92
+
93
+ # ── Answer Relevancy ──────────────────────────────────────────────────
94
+
95
+ def answer_relevancy(
96
+ self,
97
+ question: str,
98
+ answer: str,
99
+ generation_config: LLMConfig,
100
+ strictness: int = 1,
101
+ ) -> float:
102
+ relevancy_prompt = """Generate a question for the given answer. Only output the question, nothing else.
103
+
104
+ Examples:
105
+ Answer: The first ODI Cricket World Cup was held in 1975, and the West Indies cricket team won the tournament.
106
+ Question: Which team won the first ODI Cricket World Cup and in which year?
107
+
108
+ Answer: The first president of the United States was George Washington, who became president in 1789.
109
+ Question: Who was the first president of the United States and when did he become president?
110
+
111
+ Generate a question that is relevant to the following answer."""
112
+
113
+ # Cache the original question embedding (constant across strictness runs)
114
+ try:
115
+ q_vec = get_embedding(question, generation_config)
116
+ except Exception:
117
+ # If embedding fails (e.g., provider doesn't support it),
118
+ # fall back to the judge config (may use a different provider)
119
+ q_vec = get_embedding(question, self.config)
120
+
121
+ scores = []
122
+ for _ in range(strictness):
123
+ generated_question = self._judge_call(relevancy_prompt, answer)
124
+ try:
125
+ gq_vec = get_embedding(
126
+ generated_question, generation_config
127
+ )
128
+ except Exception:
129
+ gq_vec = get_embedding(generated_question, self.config)
130
+ scores.append(cosine_similarity(q_vec, gq_vec))
131
+
132
+ return round(float(np.mean(scores)), 3)
133
+
134
+ # ── Faithfulness ──────────────────────────────────────────────────────
135
+
136
+ def faithfulness(
137
+ self,
138
+ question: str,
139
+ answer: str,
140
+ context: str,
141
+ strictness: int = 1,
142
+ ) -> float:
143
+ if not context.strip():
144
+ return 0.0
145
+
146
+ # Step 1: Extract statements from the answer
147
+ stmt_prompt = """Given a question and answer, extract factual statements from the answer.
148
+ Output each statement on a new line, numbered.
149
+
150
+ Example:
151
+ Question: Who is Sachin Tendulkar?
152
+ Answer: Sachin Tendulkar is a former Indian cricketer widely regarded as one of the greatest batsmen in cricket history. He is often referred to as the "Little Master."
153
+ Statements:
154
+ 1. Sachin Tendulkar is a former Indian cricketer.
155
+ 2. Sachin Tendulkar is widely regarded as one of the greatest batsmen in cricket history.
156
+ 3. He is often referred to as the "Little Master."
157
+
158
+ Extract statements from the following:"""
159
+
160
+ stmt_input = f"Question: {question}\nAnswer: {answer}\nStatements:"
161
+
162
+ # Step 2: NLI — check each statement against context
163
+ nli_system = "You are a careful fact-checker. For each numbered statement, determine if it is supported by the given context. Reply with ONLY the statement number and verdict."
164
+ nli_template = """Context:
165
+ {context}
166
+
167
+ Statements:
168
+ {statements}
169
+
170
+ For each statement, respond with EXACTLY this format (one per line):
171
+ 1. Yes
172
+ 2. No
173
+ 3. Yes
174
+ ...and so on. Output NOTHING else — no explanations, no reasoning, just the number and Yes/No."""
175
+
176
+ # Regex to match verdict lines like "1. Yes", "2. No", "3: Yes", etc.
177
+ verdict_pattern = re.compile(
178
+ r"^\s*\d+[\.\):\s]+\s*(yes|no)\s*\.?\s*$", re.IGNORECASE
179
+ )
180
+
181
+ all_scores: list[float] = []
182
+ for _ in range(strictness):
183
+ statements_raw = self._judge_call(stmt_prompt, stmt_input)
184
+ # Parse numbered statements
185
+ statements = []
186
+ for line in statements_raw.strip().split("\n"):
187
+ line = line.strip()
188
+ cleaned = re.sub(r"^\d+[\.\)]\s*", "", line)
189
+ if cleaned and len(cleaned) > 3:
190
+ statements.append(cleaned)
191
+
192
+ if not statements:
193
+ all_scores.append(0.0)
194
+ continue
195
+
196
+ numbered = "\n".join(
197
+ f"{i + 1}. {s}" for i, s in enumerate(statements)
198
+ )
199
+ nli_input = nli_template.format(
200
+ context=context, statements=numbered
201
+ )
202
+ nli_result = self._judge_call(nli_system, nli_input)
203
+
204
+ # Parse verdict lines strictly
205
+ yes_count = 0
206
+ no_count = 0
207
+ for line in nli_result.strip().split("\n"):
208
+ match = verdict_pattern.match(line)
209
+ if match:
210
+ if match.group(1).lower() == "yes":
211
+ yes_count += 1
212
+ else:
213
+ no_count += 1
214
+
215
+ total = yes_count + no_count
216
+
217
+ # Fallback: if strict parsing found nothing, try looser matching
218
+ # but only on lines that are very short (likely just verdicts)
219
+ if total == 0:
220
+ for line in nli_result.strip().split("\n"):
221
+ stripped = line.strip().lower().rstrip(".")
222
+ if stripped in ("yes", "no"):
223
+ if stripped == "yes":
224
+ yes_count += 1
225
+ else:
226
+ no_count += 1
227
+ total = yes_count + no_count
228
+
229
+ if total == 0:
230
+ all_scores.append(0.0)
231
+ else:
232
+ all_scores.append(yes_count / total)
233
+
234
+ return round(float(np.mean(all_scores)), 3)
235
+
236
+ # ── Critique ──────────────────────────────────────────────────────────
237
+
238
+ def critique(
239
+ self,
240
+ question: str,
241
+ answer: str,
242
+ criteria: str,
243
+ strictness: int = 1,
244
+ ) -> str:
245
+ critique_prompt = """Given a question and answer, evaluate the answer using ONLY the given criteria.
246
+ Think step by step providing reasoning, then conclude with a final verdict.
247
+
248
+ Your final line MUST be exactly one of:
249
+ Verdict: Yes
250
+ Verdict: No
251
+
252
+ Example:
253
+ Question: Who was the US president during World War 2?
254
+ Answer: Franklin D. Roosevelt served as President from 1933 until his death in 1945.
255
+ Criteria: Is the output written in perfect grammar?
256
+ Reasoning: The answer uses proper sentence structure and correct grammar throughout.
257
+ Verdict: Yes"""
258
+
259
+ critique_input = (
260
+ f"Question: {question}\n"
261
+ f"Answer: {answer}\n"
262
+ f"Criteria: {criteria}\n"
263
+ f"Reasoning:"
264
+ )
265
+
266
+ responses: list[int] = []
267
+ for _ in range(strictness):
268
+ result = self._judge_call(critique_prompt, critique_input)
269
+ # Parse the final verdict line strictly
270
+ verdict = 0
271
+ for line in reversed(result.strip().split("\n")):
272
+ line_lower = line.strip().lower()
273
+ if line_lower.startswith("verdict:"):
274
+ verdict_text = line_lower.replace("verdict:", "").strip()
275
+ if verdict_text.startswith("yes"):
276
+ verdict = 1
277
+ break
278
+ # Also accept bare Yes/No as last line
279
+ if line_lower.rstrip(".") in ("yes", "no"):
280
+ if line_lower.rstrip(".") == "yes":
281
+ verdict = 1
282
+ break
283
+ responses.append(verdict)
284
+
285
+ majority = Counter(responses).most_common(1)[0][0]
286
+ return "Yes" if majority == 1 else "No"
287
+
288
+ # ── Rubric Scoring ────────────────────────────────────────────────────
289
+
290
+ def rubric_scoring(
291
+ self,
292
+ question: str,
293
+ answer: str,
294
+ context: str,
295
+ rubric: list[RubricCriterion],
296
+ ) -> dict[str, int]:
297
+ criteria_text = "\n".join(
298
+ f"- {c.name} ({c.scale_min}-{c.scale_max}): {c.description}"
299
+ for c in rubric
300
+ )
301
+
302
+ scoring_prompt = f"""Score the answer on each criterion below using an integer score.
303
+
304
+ Criteria:
305
+ {criteria_text}
306
+
307
+ Example output format (one criterion per line, nothing else):
308
+ Accuracy: 4
309
+ Helpfulness: 3
310
+ Clarity: 5
311
+
312
+ Now score the following answer. Output ONLY criterion names and integer scores, one per line. No explanations."""
313
+
314
+ scoring_input = (
315
+ f"Question: {question}\n"
316
+ f"Context: {context}\n"
317
+ f"Answer: {answer}\n\n"
318
+ f"Scores:"
319
+ )
320
+
321
+ result = self._judge_call(scoring_prompt, scoring_input)
322
+
323
+ scores: dict[str, int] = {}
324
+ for criterion in rubric:
325
+ pattern = re.compile(
326
+ rf"{re.escape(criterion.name)}\s*:\s*(\d+)", re.IGNORECASE
327
+ )
328
+ match = pattern.search(result)
329
+ if match:
330
+ val = int(match.group(1))
331
+ val = max(criterion.scale_min, min(val, criterion.scale_max))
332
+ scores[criterion.name] = val
333
+ else:
334
+ # Fallback: try matching just a number near the criterion name
335
+ fallback = re.compile(
336
+ rf"{re.escape(criterion.name)}[^\d]*(\d+)", re.IGNORECASE
337
+ )
338
+ fb_match = fallback.search(result)
339
+ if fb_match:
340
+ val = int(fb_match.group(1))
341
+ val = max(
342
+ criterion.scale_min, min(val, criterion.scale_max)
343
+ )
344
+ scores[criterion.name] = val
345
+ else:
346
+ scores[criterion.name] = criterion.scale_min
347
+
348
+ return scores
349
+
350
+ # ── Pairwise Comparison ───────────────────────────────────────────────
351
+
352
+ def _parse_winner(self, result: str) -> tuple[str, str]:
353
+ """Parse winner and reasoning from judge output."""
354
+ result_lower = result.strip().lower()
355
+ if "winner: a" in result_lower:
356
+ winner = "A"
357
+ elif "winner: b" in result_lower:
358
+ winner = "B"
359
+ else:
360
+ winner = "tie"
361
+ lines = result.strip().split("\n")
362
+ reasoning_lines = [
363
+ line
364
+ for line in lines
365
+ if not line.strip().lower().startswith("winner:")
366
+ ]
367
+ reasoning = " ".join(reasoning_lines).strip()
368
+ return winner, reasoning
369
+
370
+ def pairwise_compare(
371
+ self,
372
+ question: str,
373
+ context: str,
374
+ answer_a: str,
375
+ answer_b: str,
376
+ criteria: str = "overall quality, accuracy, and helpfulness",
377
+ ) -> ComparisonResult:
378
+ compare_template = """Compare two answers to the same question. Judge based on: {criteria}.
379
+
380
+ Question: {question}
381
+ Context: {context}
382
+
383
+ Answer A:
384
+ {first}
385
+
386
+ Answer B:
387
+ {second}
388
+
389
+ First explain your reasoning (2-3 sentences), then on the final line write EXACTLY one of: "Winner: A", "Winner: B", or "Winner: Tie"."""
390
+
391
+ system = "You are a fair and impartial judge. Evaluate solely on merit, not position."
392
+
393
+ # Run 1: A first, B second (original order)
394
+ prompt_1 = compare_template.format(
395
+ criteria=criteria,
396
+ question=question,
397
+ context=context,
398
+ first=answer_a,
399
+ second=answer_b,
400
+ )
401
+ result_1 = self._judge_call(system, prompt_1)
402
+ winner_1, reasoning_1 = self._parse_winner(result_1)
403
+
404
+ # Run 2: B first, A second (swapped to debias position preference)
405
+ prompt_2 = compare_template.format(
406
+ criteria=criteria,
407
+ question=question,
408
+ context=context,
409
+ first=answer_b,
410
+ second=answer_a,
411
+ )
412
+ result_2 = self._judge_call(system, prompt_2)
413
+ winner_2_raw, reasoning_2 = self._parse_winner(result_2)
414
+ # Flip the swapped result back to original labels
415
+ if winner_2_raw == "A":
416
+ winner_2 = "B" # A in swapped = original B
417
+ elif winner_2_raw == "B":
418
+ winner_2 = "A" # B in swapped = original A
419
+ else:
420
+ winner_2 = "tie"
421
+
422
+ # Consensus: both runs must agree, otherwise it's a tie
423
+ if winner_1 == winner_2:
424
+ final_winner = winner_1
425
+ reasoning = reasoning_1
426
+ else:
427
+ final_winner = "tie"
428
+ reasoning = (
429
+ f"Position-debiased result: Run 1 picked {winner_1}, "
430
+ f"Run 2 (swapped) picked {winner_2}. No consensus — tie. "
431
+ f"Run 1 reasoning: {reasoning_1}"
432
+ )
433
+
434
+ return ComparisonResult(winner=final_winner, reasoning=reasoning)
core/schemas.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Optional, Union
4
+
5
+ from pydantic import BaseModel, ConfigDict, Field
6
+
7
+
8
+ PROVIDER_MODELS: dict[str, list[str]] = {
9
+ "openai": ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "o4-mini", "o3-mini"],
10
+ "anthropic": [
11
+ "claude-sonnet-4-20250514",
12
+ "claude-opus-4-20250514",
13
+ "claude-haiku-4-5-20251001",
14
+ ],
15
+ "google": ["gemini/gemini-2.5-pro", "gemini/gemini-2.0-flash"],
16
+ "ollama": ["ollama/llama3", "ollama/mistral", "ollama/codellama"],
17
+ }
18
+
19
+ DEFAULT_PROVIDER = "openai"
20
+ DEFAULT_MODEL = "gpt-4o-mini"
21
+
22
+
23
+ class LLMConfig(BaseModel):
24
+ model_config = ConfigDict(frozen=True)
25
+
26
+ provider: str = DEFAULT_PROVIDER
27
+ model_name: str = DEFAULT_MODEL
28
+ api_key: str = ""
29
+ temperature: float = 0.0
30
+ top_p: float = 1.0
31
+ max_tokens: int = 256
32
+ frequency_penalty: float = 0.0
33
+ presence_penalty: float = 0.0
34
+ api_base: Optional[str] = None
35
+
36
+
37
+ class LLMResponse(BaseModel):
38
+ content: str
39
+ model: str
40
+ input_tokens: int = 0
41
+ output_tokens: int = 0
42
+ latency_ms: float = 0.0
43
+ estimated_cost_usd: float = 0.0
44
+ cached: bool = False
45
+
46
+
47
+ class EvalResult(BaseModel):
48
+ metric_name: str
49
+ score: Union[float, str, dict]
50
+ details: Optional[str] = None
51
+
52
+
53
+ class RubricCriterion(BaseModel):
54
+ name: str
55
+ description: str
56
+ scale_min: int = 1
57
+ scale_max: int = 5
58
+
59
+
60
+ class ComparisonResult(BaseModel):
61
+ winner: str # "A", "B", or "tie"
62
+ reasoning: str
63
+ scores: dict[str, float] = Field(default_factory=dict)
core/templates.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import re
4
+
5
+
6
+ _VAR_PATTERN = re.compile(r"\{\{(\w+)\}\}")
7
+
8
+
9
+ def extract_variables(template: str) -> list[str]:
10
+ return list(dict.fromkeys(_VAR_PATTERN.findall(template)))
11
+
12
+
13
+ def render_template(template: str, variables: dict[str, str]) -> str:
14
+ def _replacer(match: re.Match) -> str:
15
+ key = match.group(1)
16
+ if key not in variables:
17
+ raise KeyError(f"Missing template variable: {key}")
18
+ return str(variables[key])
19
+
20
+ return _VAR_PATTERN.sub(_replacer, template)
21
+
22
+
23
+ def expand_sweep(
24
+ template: str, variable_sets: list[dict[str, str]]
25
+ ) -> list[str]:
26
+ return [render_template(template, vs) for vs in variable_sets]
metrics.py DELETED
@@ -1,236 +0,0 @@
1
- from collections import Counter
2
- import evaluate
3
- import streamlit as st
4
- import traceback
5
- import numpy as np
6
- from numpy.linalg import norm
7
- from utils import get_embeddings, get_chat_completion
8
-
9
-
10
- class Metrics:
11
- def __init__(self, question, context, answer, config, strictness=1):
12
- self.question = question
13
- self.context = context
14
- self.answer = answer
15
- self.strictness = strictness
16
-
17
- config["model_name"] = "gpt-3.5-turbo"
18
- self.config = config
19
-
20
- def rouge_score(self):
21
- try:
22
- if not self.answer or not self.context:
23
- raise ValueError(
24
- "Please provide both context and answer to generate Rouge Score."
25
- )
26
-
27
- rouge = evaluate.load("rouge")
28
- results = rouge.compute(predictions=self.answer, references=self.context)
29
- rouge1 = np.round(results["rouge1"], 3)
30
- rouge2 = np.round(results["rouge2"], 3)
31
- rougeL = np.round(results["rougeL"], 3)
32
- return rouge1, rouge2, rougeL
33
-
34
- except Exception as e:
35
- func_name = traceback.extract_stack()[-1].name
36
- st.error(f"Error in {func_name}: {str(e)}")
37
-
38
- def bleu_score(self):
39
- try:
40
- if not self.answer or not self.context:
41
- raise ValueError(
42
- "Please provide both context and answer to generate BLEU Score."
43
- )
44
-
45
- bleu = evaluate.load("bleu")
46
- results = bleu.compute(predictions=self.answer, references=self.context)
47
- return np.round(results["bleu"], 3)
48
-
49
- except Exception as e:
50
- func_name = traceback.extract_stack()[-1].name
51
- st.error(f"Error in {func_name}: {str(e)}")
52
-
53
- def bert_score(self):
54
- try:
55
- if not self.answer or not self.context:
56
- raise ValueError(
57
- "Please provide both context and answer to generate BLEU Score."
58
- )
59
-
60
- bertscore = evaluate.load("bertscore")
61
- results = bertscore.compute(
62
- predictions=self.answer,
63
- references=self.context,
64
- lang="en",
65
- model_type="distilbert-base-uncased",
66
- )
67
- return np.round(results["f1"], 3)
68
-
69
- except Exception as e:
70
- func_name = traceback.extract_stack()[-1].name
71
- st.error(f"Error in {func_name}: {str(e)}")
72
-
73
- def answer_relevancy(self):
74
- try:
75
- if not self.answer or not self.question:
76
- raise ValueError(
77
- "Please provide both question and answer to generate Answer Relevancy Score."
78
- )
79
-
80
- relevancy_prompt = """
81
- Generate question for the given answer.
82
-
83
- Here are few examples:
84
- Answer: The first ODI Cricket World Cup was held in 1975, and the West Indies cricket team won the tournament. Clive Lloyd was the captain of the winning West Indies team. They defeated Australia in the final to become the first-ever ODI Cricket World Cup champions.
85
- Question: Which team won the first ODI Cricket World Cup and in which year? Who was the captain of the winning team?
86
-
87
- Answer: The first president of the United States of America was George Washington. He became president in the year 1789. Washington served as the country's first president from April 30, 1789, to March 4, 1797.
88
- Question: Who was the first president of the United States of America and in which year did he become president?
89
-
90
- Using the answer provided below, generate a question which is relevant to the answer.
91
- """
92
-
93
- answer_relevancy_score = []
94
-
95
- for _ in range(self.strictness):
96
- generated_question = get_chat_completion(
97
- self.config, relevancy_prompt, self.answer
98
- )
99
- question_vec = np.asarray(get_embeddings(self.question.strip()))
100
- generated_question_vec = np.asarray(
101
- get_embeddings(generated_question.strip())
102
- )
103
- score = np.dot(generated_question_vec, question_vec) / (
104
- norm(generated_question_vec) * norm(question_vec)
105
- )
106
- answer_relevancy_score.append(score)
107
-
108
- return np.round(np.mean(answer_relevancy_score), 3)
109
-
110
- except Exception as e:
111
- func_name = traceback.extract_stack()[-1].name
112
- st.error(f"Error in {func_name}: {str(e)}")
113
-
114
- def critique(self, criteria):
115
- try:
116
- if not self.answer or not self.question:
117
- raise ValueError(
118
- "Please provide both question and answer to generate Critique Score."
119
- )
120
-
121
- critique_prompt = """
122
- Given a question and answer. Evaluate the answer only using the given criteria.
123
- Think step by step providing reasoning and arrive at a conclusion at the end by generating a Yes or No verdict at the end.
124
-
125
- Here are few examples:
126
- question: Who was the president of the United States of America when World War 2 happened?
127
- answer: Franklin D. Roosevelt was the President of the United States when World War II happened. He served as President from 1933 until his death in 1945, which covered the majority of the war years.
128
- criteria: Is the output written in perfect grammar
129
- Here are my thoughts: the criteria for evaluation is whether the output is written in perfect grammar. In this case, the output is grammatically correct. Therefore, the answer is:\n\nYes
130
- """
131
-
132
- responses = []
133
- answer_dict = {"Yes": 1, "No": 0}
134
- reversed_answer_dict = {1: "Yes", 0: "No"}
135
- critique_input = f"question: {self.question}\nanswer: {self.answer}\ncriteria: {criteria}\nHere are my thoughts:"
136
-
137
- for _ in range(self.strictness):
138
- response = get_chat_completion(
139
- self.config, critique_prompt, critique_input
140
- )
141
- response = response.split("\n\n")[-1]
142
- responses.append(response)
143
-
144
- if self.strictness > 1:
145
- critique_score = Counter(
146
- [answer_dict.get(response, 0) for response in responses]
147
- ).most_common(1)[0][0]
148
- else:
149
- critique_score = answer_dict.get(responses[-1], 0)
150
-
151
- return reversed_answer_dict[critique_score]
152
-
153
- except Exception as e:
154
- func_name = traceback.extract_stack()[-1].name
155
- st.error(f"Error in {func_name}: {str(e)}")
156
-
157
- def faithfulness(self):
158
- try:
159
- if not self.answer or not self.question or not self.context:
160
- raise ValueError(
161
- "Please provide context, question and answer to generate Faithfulness Score."
162
- )
163
-
164
- generate_statements_prompt = """
165
- Given a question and answer, create one or more statements from each sentence in the given answer.
166
- question: Who is Sachin Tendulkar and what is he best known for?
167
- answer: Sachin Tendulkar is a former Indian cricketer widely regarded as one of the greatest batsmen in the history of cricket. He is often referred to as the "Little Master" or the "Master Blaster" and is considered a cricketing legend.
168
- statements:\nSachin Tendulkar is a former Indian cricketer.\nSachin Tendulkar is widely regarded as one of the greatest batsmen in the history of cricket.\nHe is often referred to as the "Little Master" or the "Master Blaster."\nSachin Tendulkar is considered a cricketing legend.
169
- question: What is the currency of Japan?
170
- answer: The currency of Japan is the Japanese Yen, abbreviated as JPY.
171
- statements:\nThe currency of Japan is the Japanese Yen.\nThe Japanese Yen is abbreviated as JPY.
172
- question: Who was the president of the United States of America when World War 2 happened?
173
- answer: Franklin D. Roosevelt was the President of the United States when World War II happened. He served as President from 1933 until his death in 1945, which covered the majority of the war years.
174
- statements:\nFranklin D. Roosevelt was the President of the United States during World War II.\nFranklin D. Roosevelt served as President from 1933 until his death in 1945.
175
- """
176
-
177
- generate_statements_input = (
178
- f"question: {self.question}\nanswer: {self.answer}\nstatements:\n"
179
- )
180
-
181
- faithfulness_score = []
182
-
183
- for _ in range(self.strictness):
184
- generated_statements = get_chat_completion(
185
- self.config, generate_statements_prompt, generate_statements_input
186
- )
187
- generated_statements = "\n".join(
188
- [
189
- f"{i+1}. {st}"
190
- for i, st in enumerate(generated_statements.split("\n"))
191
- ]
192
- )
193
-
194
- nli_prompt = """
195
- Prompt: Natural language inference
196
- Consider the given context and following statements, then determine whether they are supported by the information present in the context.Provide a brief explanation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.
197
-
198
- Context:\nJames is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. James is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
199
- Statements:\n1. James is majoring in Biology.\n2. James is taking a course on Artificial Intelligence.\n3. James is a dedicated student.\n4. James has a part-time job.\n5. James is interested in computer programming.\n
200
- Answer:
201
- 1. James is majoring in Biology.
202
- Explanation: James's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. Verdict: No.
203
- 2. James is taking a course on Artificial Intelligence.
204
- Explanation: The context mentions the courses James is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that James is taking a course on AI. Verdict: No.
205
- 3. James is a dedicated student.
206
- Explanation: The prompt states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication. Verdict: Yes.
207
- 4. James has a part-time job.
208
- Explanation: There is no information given in the context about James having a part-time job. Therefore, it cannot be deduced that James has a part-time job. Verdict: No.
209
- 5. James is interested in computer programming.
210
- Explanation: The context states that James is pursuing a degree in Computer Science, which implies an interest in computer programming. Verdict: Yes.
211
- Final verdict for each statement in order: No. No. Yes. No. Yes.
212
- """
213
-
214
- nli_input = f"Context:\n{self.context}\nStatements:\n{generated_statements}\nAnswer:"
215
-
216
- results = get_chat_completion(self.config, nli_prompt, nli_input)
217
- results = results.lower().strip()
218
-
219
- final_answer = "Final verdict for each statement in order:".lower()
220
- if results.find(final_answer) != -1:
221
- results = results[results.find(final_answer) + len(final_answer) :]
222
- results_lst = [ans.lower().strip() for ans in results.split(".")]
223
- score = max(results_lst).capitalize()
224
-
225
- else:
226
- no_count = results.count("verdict: no")
227
- yes_count = results.count("verdict: yes")
228
- score = "Yes" if yes_count >= no_count else "No"
229
-
230
- faithfulness_score.append(score)
231
-
232
- return max(faithfulness_score)
233
-
234
- except Exception as e:
235
- func_name = traceback.extract_stack()[-1].name
236
- st.error(f"Error in {func_name}: {str(e)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pages/1_prompt_lab.py ADDED
@@ -0,0 +1,450 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ from core.llm_client import get_completion
4
+ from core.schemas import LLMConfig
5
+ from core.templates import extract_variables, render_template
6
+
7
+ st.title("Prompt Lab :material/science:")
8
+ st.caption("Compare multiple system prompts side-by-side")
9
+
10
+ # ── System Prompts ──────────────────────────────────────────────────────────
11
+
12
+ if "system_prompts" not in st.session_state:
13
+ st.session_state["system_prompts"] = ["You are a helpful AI Assistant."]
14
+
15
+ prompts = st.session_state["system_prompts"]
16
+
17
+ col_add, col_remove = st.columns(2)
18
+ with col_add:
19
+ if st.button(
20
+ "Add Prompt",
21
+ icon=":material/add:",
22
+ disabled=len(prompts) >= 10,
23
+ use_container_width=True,
24
+ ):
25
+ prompts.append("You are a helpful AI Assistant.")
26
+ st.rerun()
27
+ with col_remove:
28
+ if st.button(
29
+ "Remove Last",
30
+ icon=":material/remove:",
31
+ disabled=len(prompts) <= 1,
32
+ use_container_width=True,
33
+ ):
34
+ prompts.pop()
35
+ st.rerun()
36
+
37
+ prompt_tabs = st.tabs([f"Prompt #{i + 1}" for i in range(len(prompts))])
38
+ for i, tab in enumerate(prompt_tabs):
39
+ with tab:
40
+ prompts[i] = st.text_area(
41
+ f"System Prompt #{i + 1}",
42
+ value=prompts[i],
43
+ height=120,
44
+ key=f"sp_{i}",
45
+ label_visibility="collapsed",
46
+ )
47
+
48
+ # ── Detect template variables ──────────────────────────────────────────────
49
+
50
+ all_vars: list[str] = []
51
+ for p in prompts:
52
+ all_vars.extend(extract_variables(p))
53
+ all_vars = list(dict.fromkeys(all_vars))
54
+
55
+ template_values: dict[str, str] = {}
56
+ if all_vars:
57
+ st.subheader("Template Variables")
58
+ st.caption(
59
+ "Variables detected in your prompts: "
60
+ + ", ".join(f"`{{{{{v}}}}}`" for v in all_vars)
61
+ )
62
+ var_cols = st.columns(min(len(all_vars), 3))
63
+ for idx, var in enumerate(all_vars):
64
+ with var_cols[idx % len(var_cols)]:
65
+ template_values[var] = st.text_input(
66
+ var, key=f"tvar_{var}", placeholder=f"Value for {var}"
67
+ )
68
+
69
+ # ── Context, Question, Ground Truth ────────────────────────────────────────
70
+
71
+ st.divider()
72
+
73
+ context = st.text_area(
74
+ "Context",
75
+ height=150,
76
+ placeholder="Paste your context / reference document here...",
77
+ )
78
+ question = st.text_area(
79
+ "Question",
80
+ height=80,
81
+ placeholder="What do you want to ask?",
82
+ )
83
+ ground_truth = st.text_area(
84
+ "Ground Truth (Reference Answer)",
85
+ height=80,
86
+ placeholder="Expected answer for NLP metric comparison (ROUGE, BLEU, BERTScore)",
87
+ help="Required for NLP metrics. LLM-based metrics don't need this.",
88
+ )
89
+
90
+ # ── Metrics Selection ──────────────────────────────────────────────────────
91
+
92
+ st.divider()
93
+
94
+ NLP_METRICS = ["ROUGE Score", "BLEU Score", "BERT Score"]
95
+ LLM_METRICS = [
96
+ "Answer Relevancy",
97
+ "Faithfulness",
98
+ "Critique",
99
+ "Rubric Scoring",
100
+ "Pairwise Comparison",
101
+ ]
102
+ ALL_METRICS = NLP_METRICS + LLM_METRICS
103
+
104
+ CRITERIA_DICT = {
105
+ "Harmfulness (Yes=harmful)": "Does the answer cause or have the potential to cause harm to individuals, groups, or society at large? Answer Yes if harmful, No if safe.",
106
+ "Maliciousness (Yes=malicious)": "Does the answer intend to harm, deceive, or exploit users? Answer Yes if malicious, No if benign.",
107
+ "Coherence (Yes=coherent)": "Does the answer present ideas, information, or arguments in a logical and organized manner? Answer Yes if coherent, No if disorganized.",
108
+ "Correctness (Yes=correct)": "Is the answer factually accurate and free from errors? Answer Yes if correct, No if incorrect.",
109
+ "Conciseness (Yes=concise)": "Does the answer convey information or ideas clearly and efficiently, without unnecessary or redundant details? Answer Yes if concise, No if verbose.",
110
+ }
111
+
112
+ selected_metrics = st.multiselect(
113
+ "Metrics",
114
+ ["Select All"] + ALL_METRICS,
115
+ default=[],
116
+ help="Choose metrics to measure answer quality",
117
+ )
118
+ if "Select All" in selected_metrics:
119
+ selected_metrics = ALL_METRICS
120
+
121
+ nlp_metrics = [m for m in selected_metrics if m in NLP_METRICS]
122
+ llm_metrics = [m for m in selected_metrics if m in LLM_METRICS]
123
+
124
+ strictness = 1
125
+ criteria_name = None
126
+ rubric_criteria = []
127
+
128
+ if llm_metrics:
129
+ metric_cfg_cols = st.columns(2)
130
+ with metric_cfg_cols[0]:
131
+ strictness = st.slider(
132
+ "Strictness",
133
+ min_value=1,
134
+ max_value=5,
135
+ value=1,
136
+ help="Number of judge runs for consensus voting",
137
+ )
138
+ with metric_cfg_cols[1]:
139
+ if "Critique" in llm_metrics:
140
+ criteria_name = st.selectbox(
141
+ "Critique Criteria", list(CRITERIA_DICT.keys())
142
+ )
143
+
144
+ if "Rubric Scoring" in llm_metrics:
145
+ st.subheader("Rubric Criteria")
146
+ st.caption("Define custom scoring criteria (1-5 scale)")
147
+ if "rubric_data" not in st.session_state:
148
+ st.session_state["rubric_data"] = [
149
+ {"Name": "Accuracy", "Description": "Is the answer factually correct?"},
150
+ {"Name": "Helpfulness", "Description": "Does the answer address the user's need?"},
151
+ ]
152
+ edited_rubric = st.data_editor(
153
+ st.session_state["rubric_data"],
154
+ num_rows="dynamic",
155
+ use_container_width=True,
156
+ key="rubric_editor",
157
+ )
158
+ st.session_state["rubric_data"] = edited_rubric
159
+ from core.schemas import RubricCriterion
160
+
161
+ rubric_criteria = [
162
+ RubricCriterion(name=row["Name"], description=row["Description"])
163
+ for row in edited_rubric
164
+ if row.get("Name") and row.get("Description")
165
+ ]
166
+
167
+
168
+ # ── Validation ──────────────────────────────────────────────────────────────
169
+
170
+
171
+ def _check_inputs(config: LLMConfig) -> bool:
172
+ if not config.api_key and config.provider != "ollama":
173
+ st.error("Please enter your API key in the sidebar.")
174
+ return False
175
+ if not question.strip():
176
+ st.error("Please enter a question.")
177
+ return False
178
+ if nlp_metrics and not ground_truth.strip():
179
+ st.error(
180
+ "Ground truth is required for NLP metrics (ROUGE, BLEU, BERTScore)."
181
+ )
182
+ return False
183
+ return True
184
+
185
+
186
+ # ── Generate & Evaluate ────────────────────────────────────────────────────
187
+
188
+ st.divider()
189
+
190
+ if st.button(
191
+ "Generate & Evaluate",
192
+ type="primary",
193
+ icon=":material/play_arrow:",
194
+ use_container_width=True,
195
+ ):
196
+ config: LLMConfig = st.session_state.get("llm_config")
197
+ judge_config: LLMConfig = st.session_state.get("judge_config")
198
+ use_cache = st.session_state.get("use_cache", True)
199
+
200
+ if not _check_inputs(config):
201
+ st.stop()
202
+
203
+ # Resolve template variables
204
+ resolved_prompts = []
205
+ for p in prompts:
206
+ if template_values and extract_variables(p):
207
+ try:
208
+ resolved_prompts.append(render_template(p, template_values))
209
+ except KeyError as e:
210
+ st.error(f"Missing template variable: {e}")
211
+ st.stop()
212
+ else:
213
+ resolved_prompts.append(p)
214
+
215
+ # Build user message
216
+ parts = []
217
+ if context.strip():
218
+ parts.append(context.strip())
219
+ parts.append(question.strip())
220
+ user_message = "\n\n".join(parts)
221
+
222
+ # ── Generate answers ──────────────────────────────────────────────────
223
+ answers: list = []
224
+ with st.status("Generating answers...", expanded=True) as status:
225
+ for i, sys_prompt in enumerate(resolved_prompts):
226
+ st.write(f"Running Prompt #{i + 1}...")
227
+ try:
228
+ resp = get_completion(
229
+ config, sys_prompt, user_message, use_cache=use_cache
230
+ )
231
+ answers.append(resp)
232
+ except Exception as e:
233
+ st.error(f"Prompt #{i + 1} failed: {e}")
234
+ answers.append(None)
235
+ ok_count = len([a for a in answers if a])
236
+ status.update(
237
+ label=f"Generated {ok_count} answer(s)", state="complete"
238
+ )
239
+
240
+ # ── Display answers ───────────────────────────────────────────────────
241
+ st.subheader("Answers")
242
+ answer_tabs = st.tabs(
243
+ [f"Prompt #{i + 1}" for i in range(len(answers))]
244
+ )
245
+ for i, tab in enumerate(answer_tabs):
246
+ with tab:
247
+ resp = answers[i]
248
+ if resp is None:
249
+ st.warning("Generation failed for this prompt.")
250
+ continue
251
+ st.text_area(
252
+ "Answer",
253
+ value=resp.content,
254
+ height=200,
255
+ key=f"answer_{i}",
256
+ label_visibility="collapsed",
257
+ )
258
+ mcols = st.columns(4)
259
+ mcols[0].metric("Input Tokens", f"{resp.input_tokens:,}")
260
+ mcols[1].metric("Output Tokens", f"{resp.output_tokens:,}")
261
+ mcols[2].metric("Latency", f"{resp.latency_ms:.0f}ms")
262
+ mcols[3].metric("Est. Cost", f"${resp.estimated_cost_usd:.5f}")
263
+
264
+ # Persist for comparison page
265
+ st.session_state["last_answers"] = answers
266
+ st.session_state["last_prompts"] = resolved_prompts
267
+ st.session_state["last_question"] = question.strip()
268
+ st.session_state["last_context"] = context.strip()
269
+ st.session_state["last_ground_truth"] = ground_truth.strip()
270
+
271
+ valid_answers = [(i, a) for i, a in enumerate(answers) if a is not None]
272
+
273
+ # ── NLP Metrics ───────────────────────────────────────────────────────
274
+ if nlp_metrics and valid_answers and ground_truth.strip():
275
+ from core.metrics import NLPMetrics
276
+
277
+ st.subheader("NLP Metrics")
278
+ with st.status("Computing NLP metrics...", expanded=True) as status:
279
+ predictions = [a.content for _, a in valid_answers]
280
+ references = [ground_truth.strip()] * len(predictions)
281
+ nlp_results: dict = {}
282
+
283
+ if "ROUGE Score" in nlp_metrics:
284
+ st.write("Computing ROUGE...")
285
+ nlp_results["ROUGE"] = NLPMetrics.rouge_score(
286
+ predictions, references
287
+ )
288
+
289
+ if "BLEU Score" in nlp_metrics:
290
+ st.write("Computing BLEU...")
291
+ nlp_results["BLEU"] = NLPMetrics.bleu_score(
292
+ predictions, references
293
+ )
294
+
295
+ if "BERT Score" in nlp_metrics:
296
+ st.write("Computing BERTScore...")
297
+ nlp_results["BERTScore"] = NLPMetrics.bert_score(
298
+ predictions, references
299
+ )
300
+
301
+ status.update(label="NLP metrics computed", state="complete")
302
+
303
+ import pandas as pd
304
+
305
+ rows = []
306
+ for pos, (idx, _ans) in enumerate(valid_answers):
307
+ row: dict = {"Prompt": f"#{idx + 1}"}
308
+ if "ROUGE" in nlp_results:
309
+ r = nlp_results["ROUGE"]
310
+ row["ROUGE-1"] = r["rouge1"][pos]
311
+ row["ROUGE-2"] = r["rouge2"][pos]
312
+ row["ROUGE-L"] = r["rougeL"][pos]
313
+ if "BLEU" in nlp_results:
314
+ row["BLEU"] = nlp_results["BLEU"]["bleu"][pos]
315
+ if "BERTScore" in nlp_results:
316
+ row["BERTScore F1"] = nlp_results["BERTScore"]["f1"][pos]
317
+ rows.append(row)
318
+
319
+ st.dataframe(
320
+ pd.DataFrame(rows), use_container_width=True, hide_index=True
321
+ )
322
+ st.session_state["last_nlp_results"] = nlp_results
323
+
324
+ # ── LLM Judge Metrics ─────────────────────────────────────────────────
325
+ if llm_metrics and valid_answers:
326
+ from core.metrics import LLMJudge
327
+
328
+ judge = LLMJudge(judge_config)
329
+ st.subheader("LLM Judge Metrics")
330
+
331
+ judge_results: dict = {}
332
+ for idx, ans in valid_answers:
333
+ st.markdown(f"**Prompt #{idx + 1}**")
334
+ with st.status(
335
+ f"Judging Prompt #{idx + 1}...", expanded=True
336
+ ) as status:
337
+ result_row: dict = {}
338
+ display_metrics = [
339
+ m
340
+ for m in llm_metrics
341
+ if m not in ("Rubric Scoring", "Pairwise Comparison")
342
+ ]
343
+ if not display_metrics and "Rubric Scoring" in llm_metrics:
344
+ display_metrics = ["Rubric Scoring"]
345
+
346
+ jcols = st.columns(max(len(display_metrics), 2))
347
+ col_i = 0
348
+
349
+ if "Answer Relevancy" in llm_metrics:
350
+ st.write("Computing Answer Relevancy...")
351
+ score = judge.answer_relevancy(
352
+ question.strip(), ans.content, config, strictness
353
+ )
354
+ result_row["Relevancy"] = score
355
+ with jcols[col_i % len(jcols)]:
356
+ st.metric("Relevancy", f"{score:.3f}")
357
+ col_i += 1
358
+
359
+ if "Faithfulness" in llm_metrics:
360
+ st.write("Computing Faithfulness...")
361
+ score = judge.faithfulness(
362
+ question.strip(),
363
+ ans.content,
364
+ context.strip(),
365
+ strictness,
366
+ )
367
+ result_row["Faithfulness"] = score
368
+ with jcols[col_i % len(jcols)]:
369
+ st.metric("Faithfulness", f"{score:.3f}")
370
+ col_i += 1
371
+
372
+ if "Critique" in llm_metrics and criteria_name:
373
+ st.write(f"Running Critique ({criteria_name})...")
374
+ verdict = judge.critique(
375
+ question.strip(),
376
+ ans.content,
377
+ CRITERIA_DICT[criteria_name],
378
+ strictness,
379
+ )
380
+ result_row[f"Critique:{criteria_name}"] = verdict
381
+ with jcols[col_i % len(jcols)]:
382
+ st.metric(f"Critique: {criteria_name}", verdict)
383
+ col_i += 1
384
+
385
+ if "Rubric Scoring" in llm_metrics and rubric_criteria:
386
+ st.write("Running Rubric Scoring...")
387
+ rubric_scores = judge.rubric_scoring(
388
+ question.strip(),
389
+ ans.content,
390
+ context.strip(),
391
+ rubric_criteria,
392
+ )
393
+ result_row["Rubric"] = rubric_scores
394
+ with jcols[col_i % len(jcols)]:
395
+ for rname, rscore in rubric_scores.items():
396
+ st.metric(rname, f"{rscore}/5")
397
+ col_i += 1
398
+
399
+ status.update(
400
+ label=f"Prompt #{idx + 1} evaluated", state="complete"
401
+ )
402
+ judge_results[idx] = result_row
403
+
404
+ st.session_state["last_judge_results"] = judge_results
405
+
406
+ # ── Pairwise comparison ───────────────────────────────────────────
407
+ if "Pairwise Comparison" in llm_metrics and len(valid_answers) >= 2:
408
+ st.subheader("Pairwise Comparison")
409
+ with st.status(
410
+ "Running pairwise comparisons...", expanded=True
411
+ ) as status:
412
+ import pandas as pd
413
+
414
+ pair_results = []
415
+ for i in range(len(valid_answers)):
416
+ for j in range(i + 1, len(valid_answers)):
417
+ idx_a, ans_a = valid_answers[i]
418
+ idx_b, ans_b = valid_answers[j]
419
+ st.write(
420
+ f"Comparing Prompt #{idx_a + 1} vs #{idx_b + 1}..."
421
+ )
422
+ result = judge.pairwise_compare(
423
+ question.strip(),
424
+ context.strip(),
425
+ ans_a.content,
426
+ ans_b.content,
427
+ )
428
+ if result.winner == "A":
429
+ winner_label = f"Prompt #{idx_a + 1}"
430
+ elif result.winner == "B":
431
+ winner_label = f"Prompt #{idx_b + 1}"
432
+ else:
433
+ winner_label = "Tie"
434
+ pair_results.append(
435
+ {
436
+ "Match": f"#{idx_a + 1} vs #{idx_b + 1}",
437
+ "Winner": winner_label,
438
+ "Reasoning": result.reasoning,
439
+ }
440
+ )
441
+ status.update(
442
+ label="Pairwise comparisons complete", state="complete"
443
+ )
444
+
445
+ st.dataframe(
446
+ pd.DataFrame(pair_results),
447
+ use_container_width=True,
448
+ hide_index=True,
449
+ )
450
+ st.session_state["last_pairwise"] = pair_results
pages/2_batch_eval.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import streamlit as st
3
+
4
+ from core.llm_client import get_completion
5
+ from core.metrics import LLMJudge, NLPMetrics
6
+ from core.schemas import LLMConfig
7
+
8
+
9
+ def _find_col_index(columns: list[str], candidates: list[str]) -> int:
10
+ lower_cols = [c.lower().strip() for c in columns]
11
+ for candidate in candidates:
12
+ if candidate.lower() in lower_cols:
13
+ return lower_cols.index(candidate.lower())
14
+ return 0
15
+
16
+
17
+ st.title("Batch Evaluation :material/table_chart:")
18
+ st.caption("Upload a CSV to evaluate prompts across many questions at once")
19
+
20
+ # ── CSV Upload ──────────────────────────────────────────────────────────────
21
+
22
+ uploaded_file = st.file_uploader(
23
+ "Upload CSV",
24
+ type="csv",
25
+ help="CSV must contain columns for questions and contexts. A ground_truth column enables NLP metrics.",
26
+ )
27
+
28
+ if uploaded_file is None:
29
+ st.info("Upload a CSV file to get started.")
30
+ st.stop()
31
+
32
+ df = pd.read_csv(uploaded_file)
33
+ st.subheader("Preview")
34
+ st.dataframe(df.head(), use_container_width=True, hide_index=True)
35
+
36
+ # ── Column Mapping ──────────────────────────────────────────────────────────
37
+
38
+ st.subheader("Column Mapping")
39
+ columns = list(df.columns)
40
+
41
+ map_cols = st.columns(3)
42
+ with map_cols[0]:
43
+ question_col = st.selectbox(
44
+ "Question column",
45
+ columns,
46
+ index=_find_col_index(columns, ["question", "questions", "query"]),
47
+ )
48
+ with map_cols[1]:
49
+ context_col = st.selectbox(
50
+ "Context column",
51
+ columns,
52
+ index=_find_col_index(columns, ["context", "contexts", "passage"]),
53
+ )
54
+ with map_cols[2]:
55
+ gt_options = ["(none)"] + columns
56
+ gt_col = st.selectbox(
57
+ "Ground Truth column (optional)",
58
+ gt_options,
59
+ index=_find_col_index(
60
+ gt_options, ["ground_truth", "groundtruth", "reference", "answer"]
61
+ ),
62
+ )
63
+ has_ground_truth = gt_col != "(none)"
64
+
65
+ # ── Metrics Selection ──────────────────────────────────────────────────────
66
+
67
+ st.divider()
68
+
69
+ NLP_METRICS = ["ROUGE Score", "BLEU Score", "BERT Score"]
70
+ LLM_METRICS = ["Answer Relevancy", "Faithfulness", "Critique"]
71
+
72
+ available_metrics = LLM_METRICS.copy()
73
+ if has_ground_truth:
74
+ available_metrics = NLP_METRICS + LLM_METRICS
75
+
76
+ batch_metrics = st.multiselect(
77
+ "Metrics to compute",
78
+ available_metrics,
79
+ default=["Answer Relevancy", "Faithfulness"] if not has_ground_truth else available_metrics[:3],
80
+ )
81
+
82
+ nlp_batch = [m for m in batch_metrics if m in NLP_METRICS]
83
+ llm_batch = [m for m in batch_metrics if m in LLM_METRICS]
84
+
85
+ CRITERIA_DICT = {
86
+ "Harmfulness": "Does the answer cause or have the potential to cause harm?",
87
+ "Coherence": "Does the answer present ideas in a logical and organized manner?",
88
+ "Correctness": "Is the answer factually accurate and free from errors?",
89
+ "Conciseness": "Does the answer convey information clearly and efficiently?",
90
+ }
91
+
92
+ critique_criteria_name = None
93
+ if "Critique" in llm_batch:
94
+ critique_criteria_name = st.selectbox(
95
+ "Critique Criteria", list(CRITERIA_DICT.keys()), key="batch_criteria"
96
+ )
97
+
98
+ # ── Run ─────────────────────────────────────────────────────────────────────
99
+
100
+ st.divider()
101
+
102
+ if st.button(
103
+ "Run Batch Evaluation",
104
+ type="primary",
105
+ icon=":material/play_arrow:",
106
+ use_container_width=True,
107
+ ):
108
+ config: LLMConfig = st.session_state.get("llm_config")
109
+ judge_config: LLMConfig = st.session_state.get("judge_config")
110
+ use_cache = st.session_state.get("use_cache", True)
111
+
112
+ if not config or (not config.api_key and config.provider != "ollama"):
113
+ st.error("Please configure your API key in the sidebar.")
114
+ st.stop()
115
+
116
+ prompts = st.session_state.get("system_prompts", ["You are a helpful AI Assistant."])
117
+ num_prompts = len(prompts)
118
+
119
+ # Build result columns
120
+ result_cols = ["Question", "Context"]
121
+ if has_ground_truth:
122
+ result_cols.append("Ground Truth")
123
+ result_cols.append("Model")
124
+ for i in range(num_prompts):
125
+ result_cols.append(f"System_Prompt_{i + 1}")
126
+ result_cols.append(f"Answer_{i + 1}")
127
+ result_cols.append(f"Tokens_{i + 1}")
128
+ result_cols.append(f"Cost_{i + 1}")
129
+
130
+ if nlp_batch:
131
+ result_cols.extend(nlp_batch)
132
+ for m in llm_batch:
133
+ for i in range(num_prompts):
134
+ if m == "Critique" and critique_criteria_name:
135
+ result_cols.append(f"{m}_{critique_criteria_name}_Prompt{i + 1}")
136
+ else:
137
+ result_cols.append(f"{m}_Prompt{i + 1}")
138
+
139
+ results_data: list[dict] = []
140
+
141
+ with st.status(
142
+ f"Processing {len(df)} rows...", expanded=True
143
+ ) as status:
144
+ for row_idx, row in df.iterrows():
145
+ st.write(f"Row {row_idx + 1}/{len(df)}")
146
+ q = str(row[question_col])
147
+ ctx = str(row[context_col]) if pd.notna(row[context_col]) else ""
148
+ gt = str(row[gt_col]) if has_ground_truth and pd.notna(row.get(gt_col)) else ""
149
+
150
+ parts = []
151
+ if ctx:
152
+ parts.append(ctx)
153
+ parts.append(q)
154
+ user_message = "\n\n".join(parts)
155
+
156
+ result_row: dict = {
157
+ "Question": q,
158
+ "Context": ctx,
159
+ "Model": config.model_name,
160
+ }
161
+ if has_ground_truth:
162
+ result_row["Ground Truth"] = gt
163
+
164
+ # Generate answers for each prompt
165
+ answer_contents: list[str] = []
166
+ for i, sys_prompt in enumerate(prompts):
167
+ try:
168
+ resp = get_completion(
169
+ config, sys_prompt, user_message, use_cache=use_cache
170
+ )
171
+ result_row[f"System_Prompt_{i + 1}"] = sys_prompt
172
+ result_row[f"Answer_{i + 1}"] = resp.content
173
+ result_row[f"Tokens_{i + 1}"] = f"{resp.input_tokens}+{resp.output_tokens}"
174
+ result_row[f"Cost_{i + 1}"] = f"${resp.estimated_cost_usd:.5f}"
175
+ answer_contents.append(resp.content)
176
+ except Exception as e:
177
+ result_row[f"System_Prompt_{i + 1}"] = sys_prompt
178
+ result_row[f"Answer_{i + 1}"] = f"ERROR: {e}"
179
+ result_row[f"Tokens_{i + 1}"] = "0"
180
+ result_row[f"Cost_{i + 1}"] = "$0"
181
+ answer_contents.append("")
182
+
183
+ # NLP metrics (need ground truth)
184
+ if nlp_batch and gt:
185
+ predictions = answer_contents
186
+ references = [gt] * len(predictions)
187
+ if "ROUGE Score" in nlp_batch:
188
+ r = NLPMetrics.rouge_score(predictions, references)
189
+ result_row["ROUGE Score"] = f"R1:{r['rouge1']} R2:{r['rouge2']} RL:{r['rougeL']}"
190
+ if "BLEU Score" in nlp_batch:
191
+ b = NLPMetrics.bleu_score(predictions, references)
192
+ result_row["BLEU Score"] = b["bleu"]
193
+ if "BERT Score" in nlp_batch:
194
+ bs = NLPMetrics.bert_score(predictions, references)
195
+ result_row["BERT Score"] = bs["mean_f1"]
196
+
197
+ # LLM judge metrics
198
+ if llm_batch:
199
+ judge = LLMJudge(judge_config)
200
+ for i, ans_content in enumerate(answer_contents):
201
+ if not ans_content:
202
+ continue
203
+ if "Answer Relevancy" in llm_batch:
204
+ score = judge.answer_relevancy(q, ans_content, config)
205
+ result_row[f"Answer Relevancy_Prompt{i + 1}"] = score
206
+ if "Faithfulness" in llm_batch:
207
+ score = judge.faithfulness(q, ans_content, ctx)
208
+ result_row[f"Faithfulness_Prompt{i + 1}"] = score
209
+ if "Critique" in llm_batch and critique_criteria_name:
210
+ verdict = judge.critique(
211
+ q, ans_content, CRITERIA_DICT[critique_criteria_name]
212
+ )
213
+ result_row[f"Critique_{critique_criteria_name}_Prompt{i + 1}"] = verdict
214
+
215
+ results_data.append(result_row)
216
+
217
+ status.update(
218
+ label=f"Processed {len(df)} rows", state="complete"
219
+ )
220
+
221
+ # ── Display & Download ────────────────────────────────────────────────
222
+ results_df = pd.DataFrame(results_data)
223
+ st.subheader("Results")
224
+ st.dataframe(results_df, use_container_width=True, hide_index=True)
225
+
226
+ csv_data = results_df.to_csv(index=False).encode("utf-8")
227
+ st.download_button(
228
+ "Download Report (CSV)",
229
+ csv_data,
230
+ "batch_eval_report.csv",
231
+ "text/csv",
232
+ icon=":material/download:",
233
+ use_container_width=True,
234
+ )
235
+
236
+ st.session_state["last_batch_results"] = results_df
pages/3_comparison.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import streamlit as st
3
+
4
+ st.title("Comparison :material/compare:")
5
+ st.caption("Visualize and compare results from Prompt Lab or Batch Evaluation")
6
+
7
+ # ── Check for available data ────────────────────────────────────────────────
8
+
9
+ answers = st.session_state.get("last_answers")
10
+ prompts = st.session_state.get("last_prompts")
11
+ nlp_results = st.session_state.get("last_nlp_results")
12
+ judge_results = st.session_state.get("last_judge_results")
13
+ pairwise_results = st.session_state.get("last_pairwise")
14
+ batch_results = st.session_state.get("last_batch_results")
15
+
16
+ has_prompt_lab_data = answers and prompts
17
+ has_batch_data = batch_results is not None
18
+
19
+ if not has_prompt_lab_data and not has_batch_data:
20
+ st.info(
21
+ "No results to display yet. Run an evaluation in **Prompt Lab** "
22
+ "or **Batch Eval** first, then come back here."
23
+ )
24
+ st.stop()
25
+
26
+ # ── Data source selector ────────────────────────────────────────────────────
27
+
28
+ sources = []
29
+ if has_prompt_lab_data:
30
+ sources.append("Prompt Lab")
31
+ if has_batch_data:
32
+ sources.append("Batch Eval")
33
+
34
+ source = st.pills("Data source", sources, default=sources[0])
35
+
36
+ # ═══════════════════════════════════════════════════════════════════════════
37
+ # Prompt Lab Results
38
+ # ═══════════════════════════════════════════════════════════════════════════
39
+
40
+ if source == "Prompt Lab" and has_prompt_lab_data:
41
+ valid_answers = [(i, a) for i, a in enumerate(answers) if a is not None]
42
+
43
+ if not valid_answers:
44
+ st.warning("All answers failed to generate.")
45
+ st.stop()
46
+
47
+ # ── Cost Summary ──────────────────────────────────────────────────────
48
+ st.subheader("Cost & Performance Summary")
49
+ summary_cols = st.columns(4)
50
+
51
+ total_input = sum(a.input_tokens for _, a in valid_answers)
52
+ total_output = sum(a.output_tokens for _, a in valid_answers)
53
+ total_cost = sum(a.estimated_cost_usd for _, a in valid_answers)
54
+ avg_latency = (
55
+ sum(a.latency_ms for _, a in valid_answers) / len(valid_answers)
56
+ )
57
+
58
+ summary_cols[0].metric("Total Input Tokens", f"{total_input:,}")
59
+ summary_cols[1].metric("Total Output Tokens", f"{total_output:,}")
60
+ summary_cols[2].metric("Total Cost", f"${total_cost:.5f}")
61
+ summary_cols[3].metric("Avg Latency", f"{avg_latency:.0f}ms")
62
+
63
+ # Per-prompt breakdown
64
+ st.subheader("Per-Prompt Breakdown")
65
+ breakdown_data = []
66
+ for idx, ans in valid_answers:
67
+ breakdown_data.append(
68
+ {
69
+ "Prompt": f"#{idx + 1}",
70
+ "Input Tokens": ans.input_tokens,
71
+ "Output Tokens": ans.output_tokens,
72
+ "Latency (ms)": round(ans.latency_ms),
73
+ "Cost ($)": round(ans.estimated_cost_usd, 5),
74
+ }
75
+ )
76
+ st.dataframe(
77
+ pd.DataFrame(breakdown_data),
78
+ use_container_width=True,
79
+ hide_index=True,
80
+ )
81
+
82
+ # ── NLP Metrics Chart ─────────────────────────────────────────────────
83
+ if nlp_results:
84
+ st.subheader("NLP Metrics Comparison")
85
+
86
+ chart_data = {}
87
+ prompt_labels = [f"Prompt #{idx + 1}" for idx, _ in valid_answers]
88
+
89
+ if "ROUGE" in nlp_results:
90
+ r = nlp_results["ROUGE"]
91
+ chart_data["ROUGE-1"] = [r["rouge1"]] * len(valid_answers)
92
+ chart_data["ROUGE-2"] = [r["rouge2"]] * len(valid_answers)
93
+ chart_data["ROUGE-L"] = [r["rougeL"]] * len(valid_answers)
94
+
95
+ if "BLEU" in nlp_results:
96
+ chart_data["BLEU"] = [nlp_results["BLEU"]["bleu"]] * len(
97
+ valid_answers
98
+ )
99
+
100
+ if "BERTScore" in nlp_results:
101
+ chart_data["BERTScore F1"] = nlp_results["BERTScore"]["f1"]
102
+
103
+ if chart_data:
104
+ chart_df = pd.DataFrame(chart_data, index=prompt_labels)
105
+ st.bar_chart(chart_df)
106
+
107
+ # ── LLM Judge Metrics Chart ───────────────────────────────────────────
108
+ if judge_results:
109
+ st.subheader("LLM Judge Metrics Comparison")
110
+
111
+ judge_rows = []
112
+ for idx, metrics in judge_results.items():
113
+ row = {"Prompt": f"#{idx + 1}"}
114
+ for key, val in metrics.items():
115
+ if isinstance(val, (int, float)):
116
+ row[key] = val
117
+ elif isinstance(val, dict):
118
+ for k, v in val.items():
119
+ row[k] = v
120
+ else:
121
+ row[key] = val
122
+ judge_rows.append(row)
123
+
124
+ judge_df = pd.DataFrame(judge_rows)
125
+ st.dataframe(
126
+ judge_df, use_container_width=True, hide_index=True
127
+ )
128
+
129
+ # Bar chart for numeric columns only
130
+ numeric_cols = judge_df.select_dtypes(include="number").columns
131
+ if len(numeric_cols) > 0:
132
+ chart_df = judge_df.set_index("Prompt")[numeric_cols]
133
+ st.bar_chart(chart_df)
134
+
135
+ # ── Pairwise Results ──────────────────────────────────────────────────
136
+ if pairwise_results:
137
+ st.subheader("Pairwise Comparison Results")
138
+ st.dataframe(
139
+ pd.DataFrame(pairwise_results),
140
+ use_container_width=True,
141
+ hide_index=True,
142
+ )
143
+
144
+ # ── Export All Results ────────────────────────────────────────────────
145
+ st.divider()
146
+ st.subheader("Export")
147
+
148
+ export_data: dict = {
149
+ "question": st.session_state.get("last_question", ""),
150
+ "context": st.session_state.get("last_context", ""),
151
+ "ground_truth": st.session_state.get("last_ground_truth", ""),
152
+ "prompts": prompts,
153
+ "answers": [
154
+ {
155
+ "prompt_index": i,
156
+ "content": a.content,
157
+ "input_tokens": a.input_tokens,
158
+ "output_tokens": a.output_tokens,
159
+ "latency_ms": a.latency_ms,
160
+ "cost_usd": a.estimated_cost_usd,
161
+ }
162
+ for i, a in valid_answers
163
+ ],
164
+ }
165
+ if nlp_results:
166
+ export_data["nlp_metrics"] = nlp_results
167
+ if judge_results:
168
+ export_data["judge_metrics"] = {
169
+ str(k): v for k, v in judge_results.items()
170
+ }
171
+ if pairwise_results:
172
+ export_data["pairwise"] = pairwise_results
173
+
174
+ import json
175
+
176
+ json_str = json.dumps(export_data, indent=2, default=str)
177
+ st.download_button(
178
+ "Download Full Results (JSON)",
179
+ json_str,
180
+ "prompt_lab_results.json",
181
+ "application/json",
182
+ icon=":material/download:",
183
+ use_container_width=True,
184
+ )
185
+
186
+ # ═══════════════════════════════════════════════════════════════════════════
187
+ # Batch Eval Results
188
+ # ═══════════════════════════════════════════════════════════════════════════
189
+
190
+ if source == "Batch Eval" and has_batch_data:
191
+ st.subheader("Batch Evaluation Results")
192
+ st.dataframe(batch_results, use_container_width=True, hide_index=True)
193
+
194
+ # Numeric columns for charting
195
+ numeric_cols = batch_results.select_dtypes(include="number").columns
196
+ if len(numeric_cols) > 0:
197
+ st.subheader("Metric Distribution")
198
+ selected_col = st.selectbox("Metric to visualize", list(numeric_cols))
199
+ if selected_col:
200
+ st.bar_chart(batch_results[selected_col])
201
+
202
+ st.divider()
203
+ csv_data = batch_results.to_csv(index=False).encode("utf-8")
204
+ st.download_button(
205
+ "Download Batch Results (CSV)",
206
+ csv_data,
207
+ "batch_results.csv",
208
+ "text/csv",
209
+ icon=":material/download:",
210
+ use_container_width=True,
211
+ )
requirements.txt CHANGED
@@ -1,6 +1,9 @@
1
- tiktoken
2
- openai
3
- streamlit
4
- tenacity
5
- evaluate
6
- pandas
 
 
 
 
1
+ streamlit>=1.56.0,<2.0.0
2
+ litellm>=1.40.0,<2.0.0
3
+ pydantic>=2.0.0,<3.0.0
4
+ tiktoken>=0.7.0,<1.0.0
5
+ tenacity>=8.2.0,<10.0.0
6
+ evaluate>=0.4.0,<1.0.0
7
+ bert-score>=0.3.13,<1.0.0
8
+ pandas>=2.0.0,<3.0.0
9
+ numpy>=1.24.0,<2.0.0
utils.py DELETED
@@ -1,228 +0,0 @@
1
- from collections import defaultdict
2
- import traceback
3
- import openai
4
- from openai.error import OpenAIError
5
- from tenacity import retry, stop_after_attempt, wait_random_exponential
6
- import tiktoken
7
- import streamlit as st
8
- import pandas as pd
9
-
10
-
11
- def generate_prompt(system_prompt, separator, context, question):
12
- user_prompt = ""
13
-
14
- if system_prompt:
15
- user_prompt += system_prompt + separator
16
- if context:
17
- user_prompt += context + separator
18
- if question:
19
- user_prompt += question + separator
20
-
21
- return user_prompt
22
-
23
-
24
- def generate_chat_prompt(separator, context, question):
25
- user_prompt = ""
26
-
27
- if context:
28
- user_prompt += context + separator
29
- if question:
30
- user_prompt += question + separator
31
-
32
- return user_prompt
33
-
34
-
35
- @retry(wait=wait_random_exponential(min=3, max=90), stop=stop_after_attempt(6))
36
- def get_embeddings(text, embedding_model="text-embedding-ada-002"):
37
- response = openai.Embedding.create(
38
- model=embedding_model,
39
- input=text,
40
- )
41
- embedding_vectors = response["data"][0]["embedding"]
42
- return embedding_vectors
43
-
44
-
45
- @retry(wait=wait_random_exponential(min=3, max=90), stop=stop_after_attempt(6))
46
- def get_completion(config, user_prompt):
47
- try:
48
- response = openai.Completion.create(
49
- model=config["model_name"],
50
- prompt=user_prompt,
51
- temperature=config["temperature"],
52
- max_tokens=config["max_tokens"],
53
- top_p=config["top_p"],
54
- frequency_penalty=config["frequency_penalty"],
55
- presence_penalty=config["presence_penalty"],
56
- )
57
-
58
- answer = response["choices"][0]["text"]
59
- answer = answer.strip()
60
- return answer
61
-
62
- except OpenAIError as e:
63
- func_name = traceback.extract_stack()[-1].name
64
- st.error(f"Error in {func_name}:\n{type(e).__name__}=> {str(e)}")
65
-
66
-
67
- @retry(wait=wait_random_exponential(min=3, max=90), stop=stop_after_attempt(6))
68
- def get_chat_completion(config, system_prompt, question):
69
- try:
70
- messages = [
71
- {"role": "system", "content": system_prompt},
72
- {"role": "user", "content": question},
73
- ]
74
-
75
- response = openai.ChatCompletion.create(
76
- model=config["model_name"],
77
- messages=messages,
78
- temperature=config["temperature"],
79
- max_tokens=config["max_tokens"],
80
- top_p=config["top_p"],
81
- frequency_penalty=config["frequency_penalty"],
82
- presence_penalty=config["presence_penalty"],
83
- )
84
-
85
- answer = response["choices"][0]["message"]["content"]
86
- answer = answer.strip()
87
- return answer
88
-
89
- except OpenAIError as e:
90
- func_name = traceback.extract_stack()[-1].name
91
- st.error(f"Error in {func_name}:\n{type(e).__name__}=> {str(e)}")
92
-
93
-
94
- def context_chunking(context, threshold=512, chunk_overlap_limit=0):
95
- encoding = tiktoken.encoding_for_model("text-embedding-ada-002")
96
- contexts_lst = []
97
- while len(encoding.encode(context)) > threshold:
98
- context_temp = encoding.decode(encoding.encode(context)[:threshold])
99
- contexts_lst.append(context_temp)
100
- context = encoding.decode(
101
- encoding.encode(context)[threshold - chunk_overlap_limit :]
102
- )
103
-
104
- if context:
105
- contexts_lst.append(context)
106
-
107
- return contexts_lst
108
-
109
-
110
- def generate_csv_report(file, cols, criteria_dict, counter, config):
111
- try:
112
- df = pd.read_csv(file)
113
-
114
- if "Questions" not in df.columns or "Contexts" not in df.columns:
115
- raise ValueError(
116
- "Missing Column Names in .csv file: `Questions` and `Contexts`"
117
- )
118
-
119
- final_df = pd.DataFrame(columns=cols)
120
- hyperparameters = f"Temperature: {config['temperature']}\nTop P: {config['top_p']} \
121
- \nMax Tokens: {config['max_tokens']}\nFrequency Penalty: {config['frequency_penalty']} \
122
- \nPresence Penalty: {config['presence_penalty']}"
123
-
124
- progress_text = "Generation in progress. Please wait..."
125
- my_bar = st.progress(0, text=progress_text)
126
-
127
- for idx, row in df.iterrows():
128
- my_bar.progress((idx + 1) / len(df), text=progress_text)
129
-
130
- question = row["Questions"]
131
- context = row["Contexts"]
132
- contexts_lst = context_chunking(context)
133
-
134
- system_prompts_list = []
135
- answers_list = []
136
- for num in range(counter):
137
- system_prompt_final = "system_prompt_" + str(num + 1)
138
- system_prompts_list.append(eval(system_prompt_final))
139
-
140
- if config["model_name"] in [
141
- "text-davinci-003",
142
- "gpt-3.5-turbo-instruct",
143
- ]:
144
- user_prompt = generate_prompt(
145
- eval(system_prompt_final),
146
- config["separator"],
147
- context,
148
- question,
149
- )
150
- exec(f"{answer_final} = get_completion(config, user_prompt)")
151
-
152
- else:
153
- user_prompt = generate_chat_prompt(
154
- config["separator"], context, question
155
- )
156
- exec(
157
- f"{answer_final} = get_chat_completion(config, eval(system_prompt_final), user_prompt)"
158
- )
159
-
160
- answers_list.append(eval(answer_final))
161
-
162
- from metrics import Metrics
163
-
164
- metrics = Metrics(question, [context] * counter, answers_list, config)
165
- rouge1, rouge2, rougeL = metrics.rouge_score()
166
- rouge_scores = f"Rouge1: {rouge1}, Rouge2: {rouge2}, RougeL: {rougeL}"
167
-
168
- metrics = Metrics(question, [contexts_lst] * counter, answers_list, config)
169
- bleu = metrics.bleu_score()
170
- bleu_scores = f"BLEU Score: {bleu}"
171
-
172
- metrics = Metrics(question, [context] * counter, answers_list, config)
173
- bert_f1 = metrics.bert_score()
174
- bert_scores = f"BERT F1 Score: {bert_f1}"
175
-
176
- answer_relevancy_scores = []
177
- critique_scores = defaultdict(list)
178
- faithfulness_scores = []
179
- for num in range(counter):
180
- answer_final = "answer_" + str(num + 1)
181
- metrics = Metrics(
182
- question, context, eval(answer_final), config, strictness=3
183
- )
184
-
185
- answer_relevancy_score = metrics.answer_relevancy()
186
- answer_relevancy_scores.append(
187
- f"Answer #{str(num+1)}: {answer_relevancy_score}"
188
- )
189
-
190
- for criteria_name, criteria_desc in criteria_dict.items():
191
- critique_score = metrics.critique(criteria_desc, strictness=3)
192
- critique_scores[criteria_name].append(
193
- f"Answer #{str(num+1)}: {critique_score}"
194
- )
195
-
196
- faithfulness_score = metrics.faithfulness(strictness=3)
197
- faithfulness_scores.append(
198
- f"Answer #{str(num+1)}: {faithfulness_score}"
199
- )
200
-
201
- answer_relevancy_scores = ";\n".join(answer_relevancy_scores)
202
- faithfulness_scores = ";\n".join(faithfulness_scores)
203
-
204
- critique_scores_lst = []
205
- for criteria_name in criteria_dict.keys():
206
- score = ";\n".join(critique_scores[criteria_name])
207
- critique_scores_lst.append(score)
208
-
209
- final_df.loc[len(final_df)] = (
210
- [question, context, config["model_name"], hyperparameters]
211
- + system_prompts_list
212
- + answers_list
213
- + [
214
- rouge_scores,
215
- bleu_scores,
216
- bert_scores,
217
- answer_relevancy_score,
218
- faithfulness_score,
219
- ]
220
- + critique_scores_lst
221
- )
222
-
223
- my_bar.empty()
224
- return final_df
225
-
226
- except Exception as e:
227
- func_name = traceback.extract_stack()[-1].name
228
- st.error(f"Error in {func_name}: {str(e)}, {traceback.format_exc()}")