aneeb15 commited on
Commit
9ec2fac
Β·
1 Parent(s): 335cd41

feat: implement dynamic evaluation page (remove static charts)

Browse files
Files changed (2) hide show
  1. PROJECT_HIGHLIGHTS.md +0 -54
  2. app.py +186 -127
PROJECT_HIGHLIGHTS.md DELETED
@@ -1,54 +0,0 @@
1
- # πŸš€ Auto-FineTune-Ops: Project Highlights
2
-
3
- **Autonomous Machine Learning Pipeline for Production-Grade LLM Fine-Tuning**
4
-
5
- Auto-FineTune-Ops is a comprehensive, no-code/low-code platform that democratizes access to state-of-the-art LLM fine-tuning. It automates the complex lifecycle of data preparation, training, evaluation, and deployment.
6
-
7
- ---
8
-
9
- ## 🌟 Key Features
10
-
11
- ### 1. 🧠 Intelligent Preprocessing Engine
12
- A modular, production-ready data pipeline with 10+ specialized modules:
13
- - **Text Cleaning:** Auto-strip HTML, emojis, URLs, and normalize whitespace.
14
- - **PII Redaction:** Detect and mask emails, phone numbers, and keys for security.
15
- - **Deduplication:** Remove exact and semantic duplicates (using TF-IDF/Cosine Similarity).
16
- - **Quality Filtering:** Filter by language, toxicity, and length constraints.
17
- - **Advanced Formatting:** Auto-convert loose CSV/JSON into strict Chat Templates (ShareGPT/OpenAI).
18
-
19
- ### 2. ⚑ Hybrid Training Ecosystem
20
- Flexible training workflows designed for all hardware setups:
21
- - **Local GPU Power:** Leverages **Unsloth** for 2x faster training and 70% less memory usage (4-bit quantization).
22
- - **Google Colab Bridge:** Seamless "No-GPU" fallback flow. Generate a ready-to-run Colab notebook to train on free cloud GPUs if local hardware is insufficient.
23
- - **Custom Model Support:** Fine-tune any HuggingFace model (Llama 3, Mistral, Gemma, Phi-3, etc.).
24
-
25
- ### 3. βš–οΈ Multi-Provider AI Judge Arena
26
- Production-grade model evaluation using LLM-as-a-Judge:
27
- - **Provider Agnostic:** Supports OpenAI (GPT-4o), Anthropic (Claude 3.5), Google (Gemini 1.5), and Groq (Llama 3).
28
- - **Custom Endpoints:** Connect to local LLMs (Ollama/vLLM) as judges.
29
- - **Comprehensive Metrics:** Automated scoring for Accuracy, Helpfulness, Clarity, and Tone.
30
- - **Head-to-Head:** Win-rate visualization comparing Base Model vs. Fine-Tuned Model.
31
-
32
- ### 4. πŸ–₯️ Interactive Streamlit Dashboard
33
- A premium, dark-mode UI that abstracts away CLI complexity:
34
- - **Project Management:** Manage datasets, models, and logs visually.
35
- - **Real-time Monitoring:** Track training loss and progress live.
36
- - **Visualization:** Interactive Plotly charts for evaluation results.
37
-
38
- ### 5. πŸš€ One-Click Deployment
39
- - **Instant API:** Export trained models as a production-ready **FastAPI** microservice.
40
- - **Standardized Interface:** OpenAI-compatible `/generate` endpoints for easy integration into apps.
41
-
42
- ---
43
-
44
- ## πŸ”§ Technical Stack
45
- - **Frontend:** Streamlit, Plotly
46
- - **Core ML:** PyTorch, Transformers, PEFT, Unsloth, TRL
47
- - **Data:** Pandas, NumPy, Scikit-learn
48
- - **API:** FastAPI, Uvicorn
49
- - **LLM Clients:** OpenAI SDK, Anthropic SDK
50
-
51
- ## πŸ›‘οΈ Production Readiness
52
- - **Modular Architecture:** Agent-based design (DataArchitect, TrainingPilot, TheJudge) allows easy extensibility.
53
- - **Error Handling:** Robust fallback mechanisms and detailed logging.
54
- - **Security:** PII masking and API key management best practices.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1213,178 +1213,237 @@ def render_training():
1213
  def render_evaluation():
1214
  st.markdown('<p class="gradient-header">βš–οΈ Model Evaluation</p>', unsafe_allow_html=True)
1215
 
 
 
 
 
1216
  # ── Judge Provider Selection ──
1217
  st.markdown("### πŸ€– Select AI Judge Provider")
1218
- st.caption("Choose which LLM provider to use as the evaluation judge. You can use any model you have API access to.")
1219
 
1220
  judge_provider = st.selectbox("AI Provider", [
1221
  "OpenAI (GPT-4o, GPT-4-turbo, etc.)",
1222
  "Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
1223
- "Google Gemini (Gemini Pro, Gemini 1.5, etc.)",
1224
- "Groq (Llama, Mixtral, Gemma, etc.)",
1225
  "Custom OpenAI-Compatible Endpoint"
1226
- ], help="Select the AI provider whose model will act as the judge for evaluating your fine-tuned model.")
1227
 
1228
  st.markdown("---")
1229
  st.markdown("### πŸ”‘ API Configuration")
1230
 
 
 
 
1231
  if "OpenAI" in judge_provider:
1232
  col1, col2 = st.columns(2)
1233
  with col1:
1234
- openai_key = st.text_input("OpenAI API Key", type="password",
1235
- help="Your OpenAI API key (starts with sk-)")
1236
- if openai_key:
1237
- os.environ["OPENAI_API_KEY"] = openai_key
1238
  with col2:
1239
- judge_model = st.selectbox("Judge Model", [
1240
- "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"
1241
- ])
1242
 
1243
  elif "Anthropic" in judge_provider:
1244
  col1, col2 = st.columns(2)
1245
  with col1:
1246
- anthropic_key = st.text_input("Anthropic API Key", type="password",
1247
- help="Your Anthropic API key")
1248
- if anthropic_key:
1249
- os.environ["ANTHROPIC_API_KEY"] = anthropic_key
1250
- with col2:
1251
- judge_model = st.selectbox("Judge Model", [
1252
- "claude-3-5-sonnet-20241022", "claude-3-opus-20240229",
1253
- "claude-3-sonnet-20240229", "claude-3-haiku-20240307"
1254
- ])
1255
-
1256
- elif "Gemini" in judge_provider:
1257
- col1, col2 = st.columns(2)
1258
- with col1:
1259
- gemini_key = st.text_input("Google AI API Key", type="password",
1260
- help="Your Google AI Studio API key for Gemini models")
1261
- if gemini_key:
1262
- os.environ["GOOGLE_API_KEY"] = gemini_key
1263
  with col2:
1264
- judge_model = st.selectbox("Judge Model", [
1265
- "gemini-1.5-pro", "gemini-1.5-flash", "gemini-pro"
1266
- ])
1267
 
1268
  elif "Groq" in judge_provider:
1269
  col1, col2 = st.columns(2)
1270
  with col1:
1271
- groq_key = st.text_input("Groq API Key", type="password",
1272
- help="Your Groq API key for fast inference")
1273
- if groq_key:
1274
- os.environ["GROQ_API_KEY"] = groq_key
1275
  with col2:
1276
- judge_model = st.selectbox("Judge Model", [
1277
- "llama-3.1-70b-versatile", "llama-3.1-8b-instant",
1278
- "mixtral-8x7b-32768", "gemma2-9b-it"
1279
- ])
1280
 
1281
- else: # Custom endpoint
1282
  col1, col2 = st.columns(2)
1283
  with col1:
1284
- custom_base_url = st.text_input("API Base URL",
1285
- placeholder="https://api.your-provider.com/v1",
1286
- help="OpenAI-compatible API endpoint (e.g., vLLM, Ollama, LM Studio)")
1287
- custom_api_key = st.text_input("API Key", type="password",
1288
- help="API key for the custom endpoint (use 'none' for local servers)")
1289
- if custom_api_key:
1290
- os.environ["OPENAI_API_KEY"] = custom_api_key
1291
- if custom_base_url:
1292
- os.environ["OPENAI_BASE_URL"] = custom_base_url
1293
  with col2:
1294
- judge_model = st.text_input("Model Name",
1295
- placeholder="e.g., my-model, llama-3-8b",
1296
- help="Model identifier used by your custom endpoint")
1297
 
1298
  st.markdown("---")
1299
 
1300
- # ── Model / Results to Evaluate ──
1301
  st.markdown("### πŸ“Š Evaluation Data")
 
 
 
 
 
 
 
 
1302
 
1303
- if st.session_state.model_path:
1304
- st.info(f"πŸ“¦ Model / Results: `{st.session_state.model_path}`")
1305
- else:
1306
- st.warning("⚠️ No trained model or uploaded results found. You can upload evaluation data below or train a model first.")
1307
-
1308
- # Upload evaluation data
1309
- eval_upload = st.file_uploader("Upload evaluation data (JSONL with instruction + model output)",
1310
- type=['jsonl', 'json'], key="eval_data_upload",
1311
- help="Upload a JSONL file containing instruction-response pairs to evaluate")
1312
  if eval_upload:
1313
  try:
1314
- eval_df = pd.read_json(eval_upload, lines=eval_upload.name.endswith('.jsonl'))
1315
- st.success(f"βœ… Loaded **{len(eval_df):,}** samples for evaluation")
1316
- st.dataframe(eval_df.head(5), use_container_width=True)
1317
- st.session_state['eval_data'] = eval_df
 
 
 
1318
  except Exception as e:
1319
- st.error(f"Error loading evaluation data: {e}")
 
 
 
 
 
1320
 
1321
  st.markdown("---")
1322
 
1323
- # ── Demo Charts ──
1324
- st.markdown("### πŸ“ˆ Evaluation Results")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1325
 
1326
- col1, col2 = st.columns(2)
1327
- with col1:
1328
- fig = go.Figure(data=[go.Pie(
1329
- values=[72, 18, 10],
1330
- labels=['Fine-tuned Wins', 'Base Model Wins', 'Ties'],
1331
- hole=0.6,
1332
- marker_colors=['#6366f1', '#ef4444', '#94a3b8']
1333
- )])
1334
- fig.update_layout(
1335
- title="Win Rate Distribution",
1336
- paper_bgcolor='rgba(0,0,0,0)',
1337
- plot_bgcolor='rgba(0,0,0,0)',
1338
- font_color='#e2e8f0',
1339
- showlegend=True
1340
- )
1341
- st.plotly_chart(fig, use_container_width=True)
1342
 
1343
- with col2:
1344
- fig = go.Figure(data=[
1345
- go.Bar(name='Base Model', x=['Accuracy', 'Helpfulness', 'Clarity', 'Relevance'], y=[6.2, 5.8, 6.5, 6.0], marker_color='#ef4444'),
1346
- go.Bar(name='Fine-tuned', x=['Accuracy', 'Helpfulness', 'Clarity', 'Relevance'], y=[7.8, 8.1, 7.5, 8.2], marker_color='#6366f1')
1347
- ])
1348
- fig.update_layout(
1349
- title="Score Comparison by Category",
1350
- barmode='group',
1351
- paper_bgcolor='rgba(0,0,0,0)',
1352
- plot_bgcolor='rgba(0,0,0,0)',
1353
- font_color='#e2e8f0',
1354
- yaxis_title="Score (1-10)"
1355
- )
1356
- st.plotly_chart(fig, use_container_width=True)
1357
 
1358
- # Summary metrics
1359
- col1, col2, col3, col4 = st.columns(4)
1360
- with col1:
1361
- st.metric("Win Rate", "72%", "+22%")
1362
- with col2:
1363
- st.metric("Base Avg Score", "6.4/10")
1364
- with col3:
1365
- st.metric("Fine-tuned Avg", "7.8/10", "+1.4")
1366
- with col4:
1367
- st.metric("Comparisons", "50")
 
 
 
1368
 
1369
- st.markdown("---")
1370
 
1371
- # Run evaluation
1372
- col1, col2, col3 = st.columns([1, 2, 1])
1373
- with col2:
1374
- if st.button("πŸƒ Run Full Evaluation", type="primary", use_container_width=True):
1375
- has_key = any([
1376
- os.environ.get("OPENAI_API_KEY"),
1377
- os.environ.get("ANTHROPIC_API_KEY"),
1378
- os.environ.get("GOOGLE_API_KEY"),
1379
- os.environ.get("GROQ_API_KEY"),
1380
- ])
1381
- if not has_key:
1382
- st.error("❌ Please provide an API key for your selected judge provider.")
1383
- elif not st.session_state.model_path and not st.session_state.get('eval_data') is not None:
1384
- st.error("❌ Please either train a model, upload fine-tuned results, or upload evaluation data.")
1385
- else:
1386
- st.info(f"πŸƒ Starting evaluation with **{judge_model}** as judge...")
1387
- st.warning("⏳ Full evaluation pipeline integration coming soon. Demo results shown above.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1388
 
1389
 
1390
  # ============================================================================
 
1213
  def render_evaluation():
1214
  st.markdown('<p class="gradient-header">βš–οΈ Model Evaluation</p>', unsafe_allow_html=True)
1215
 
1216
+ # Initialize session state for results if not present
1217
+ if 'eval_results' not in st.session_state:
1218
+ st.session_state.eval_results = None
1219
+
1220
  # ── Judge Provider Selection ──
1221
  st.markdown("### πŸ€– Select AI Judge Provider")
1222
+ st.caption("Choose which LLM provider to use as the evaluation judge.")
1223
 
1224
  judge_provider = st.selectbox("AI Provider", [
1225
  "OpenAI (GPT-4o, GPT-4-turbo, etc.)",
1226
  "Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
1227
+ "Groq (Llama 3, Mixtral, Gemma, etc.)",
 
1228
  "Custom OpenAI-Compatible Endpoint"
1229
+ ], help="Select the AI provider whose model will act as the judge.")
1230
 
1231
  st.markdown("---")
1232
  st.markdown("### πŸ”‘ API Configuration")
1233
 
1234
+ api_key = None
1235
+ base_url = None
1236
+
1237
  if "OpenAI" in judge_provider:
1238
  col1, col2 = st.columns(2)
1239
  with col1:
1240
+ api_key = st.text_input("OpenAI API Key", type="password", key="openai_key_input")
1241
+ if api_key: os.environ["OPENAI_API_KEY"] = api_key
 
 
1242
  with col2:
1243
+ judge_model = st.selectbox("Judge Model", ["gpt-4o", "gpt-4-turbo", "gpt-3.5-turbo"])
 
 
1244
 
1245
  elif "Anthropic" in judge_provider:
1246
  col1, col2 = st.columns(2)
1247
  with col1:
1248
+ api_key = st.text_input("Anthropic API Key", type="password", key="anthropic_key_input")
1249
+ if api_key: os.environ["ANTHROPIC_API_KEY"] = api_key
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1250
  with col2:
1251
+ judge_model = st.selectbox("Judge Model", ["claude-3-5-sonnet-20241022", "claude-3-opus-20240229", "claude-3-sonnet-20240229"])
 
 
1252
 
1253
  elif "Groq" in judge_provider:
1254
  col1, col2 = st.columns(2)
1255
  with col1:
1256
+ api_key = st.text_input("Groq API Key", type="password", key="groq_key_input")
1257
+ if api_key: os.environ["GROQ_API_KEY"] = api_key
 
 
1258
  with col2:
1259
+ judge_model = st.selectbox("Judge Model", ["llama3-70b-8192", "llama3-8b-8192", "mixtral-8x7b-32768", "gemma-7b-it"])
1260
+ base_url = "https://api.groq.com/openai/v1"
 
 
1261
 
1262
+ else: # Custom
1263
  col1, col2 = st.columns(2)
1264
  with col1:
1265
+ base_url = st.text_input("API Base URL", placeholder="https://api.your-provider.com/v1")
1266
+ api_key = st.text_input("API Key", type="password", key="custom_key_input")
1267
+ if api_key: os.environ["OPENAI_API_KEY"] = api_key
 
 
 
 
 
 
1268
  with col2:
1269
+ judge_model = st.text_input("Model Name", placeholder="e.g., my-model")
 
 
1270
 
1271
  st.markdown("---")
1272
 
1273
+ # ── Evaluation Data ──
1274
  st.markdown("### πŸ“Š Evaluation Data")
1275
+
1276
+ # 1. Use data from training (if available)
1277
+ if st.session_state.model_path and "finetuned_outputs" in str(st.session_state.model_path):
1278
+ st.info(f"Using results from training: `{st.session_state.model_path}`")
1279
+ try:
1280
+ st.session_state['eval_data'] = pd.read_json(st.session_state.model_path, lines=True)
1281
+ except Exception:
1282
+ pass
1283
 
1284
+ # 2. Upload new data
1285
+ eval_upload = st.file_uploader("Upload JSONL (Must contain: 'instruction', 'base_output', 'finetuned_output')",
1286
+ type=['jsonl', 'json'], key="eval_uploader")
1287
+
 
 
 
 
 
1288
  if eval_upload:
1289
  try:
1290
+ df = pd.read_json(eval_upload, lines=eval_upload.name.endswith('.jsonl'))
1291
+ required_cols = ['instruction', 'base_output', 'finetuned_output']
1292
+ if all(col in df.columns for col in required_cols):
1293
+ st.session_state['eval_data'] = df
1294
+ st.success(f"βœ… Loaded {len(df)} samples")
1295
+ else:
1296
+ st.error(f"❌ Missing columns! Found: {list(df.columns)}. Required: {required_cols}")
1297
  except Exception as e:
1298
+ st.error(f"Error loading file: {e}")
1299
+
1300
+ # Show Preview
1301
+ if st.session_state.get('eval_data') is not None:
1302
+ with st.expander("πŸ‘οΈ View Data Preview"):
1303
+ st.dataframe(st.session_state['eval_data'].head(3), use_container_width=True)
1304
 
1305
  st.markdown("---")
1306
 
1307
+ # ── Run Evaluation ──
1308
+ if st.button("πŸš€ Run Dynamic Evaluation", type="primary", use_container_width=True):
1309
+ if not api_key:
1310
+ st.error("❌ Please provide an API Key above!")
1311
+ return
1312
+
1313
+ if st.session_state.get('eval_data') is None:
1314
+ st.error("❌ No evaluation data loaded!")
1315
+ return
1316
+
1317
+ # Prepare Judge
1318
+ st.session_state.pipeline_status['evaluation'] = 'running'
1319
+ progress_bar = st.progress(0)
1320
+ status_text = st.empty()
1321
+
1322
+ results = []
1323
+ df = st.session_state['eval_data']
1324
+ total = len(df)
1325
+
1326
+ try:
1327
+ # Initialize Client
1328
+ client = None
1329
+ if "Anthropic" in judge_provider:
1330
+ from anthropic import Anthropic
1331
+ client = Anthropic(api_key=api_key)
1332
+ else:
1333
+ from openai import OpenAI
1334
+ client = OpenAI(api_key=api_key, base_url=base_url)
1335
+
1336
+ JUDGE_PROMPT = """You are an expert evaluator comparing two AI responses.
1337
+
1338
+ Query: {prompt}
1339
+
1340
+ Response A (Base Model):
1341
+ {response_a}
1342
+
1343
+ Response B (Fine-tuned Model):
1344
+ {response_b}
1345
+
1346
+ Compare them on: Accuracy, Helpfulness, Clarity.
1347
+ Return a valid JSON object ONLY:
1348
+ {{
1349
+ "winner": "A" or "B" or "TIE",
1350
+ "score_a": <1-10>,
1351
+ "score_b": <1-10>,
1352
+ "reasoning": "short explanation",
1353
+ "accuracy": {{"A": <1-10>, "B": <1-10>}},
1354
+ "helpfulness": {{"A": <1-10>, "B": <1-10>}},
1355
+ "clarity": {{"A": <1-10>, "B": <1-10>}}
1356
+ }}
1357
+ """
1358
 
1359
+ for i, row in df.iterrows():
1360
+ status_text.text(f"Evaluating sample {i+1}/{total}...")
1361
+
1362
+ prompt_text = JUDGE_PROMPT.format(
1363
+ prompt=row['instruction'],
1364
+ response_a=row['base_output'],
1365
+ response_b=row['finetuned_output']
1366
+ )
 
 
 
 
 
 
 
 
1367
 
1368
+ # Call API
1369
+ if "Anthropic" in judge_provider:
1370
+ resp = client.messages.create(
1371
+ model=judge_model, max_tokens=1000,
1372
+ messages=[{"role": "user", "content": prompt_text}]
1373
+ ).content[0].text
1374
+ else:
1375
+ resp = client.chat.completions.create(
1376
+ model=judge_model, max_tokens=1000,
1377
+ messages=[{"role": "user", "content": prompt_text}],
1378
+ response_format={"type": "json_object"}
1379
+ ).choices[0].message.content
 
 
1380
 
1381
+ # Parse
1382
+ try:
1383
+ import json
1384
+ # Clean json string if needed
1385
+ if "```json" in resp: resp = resp.split("```json")[1].split("```")[0]
1386
+ if "```" in resp: resp = resp.split("```")[1]
1387
+
1388
+ data = json.loads(resp.strip())
1389
+ data['instruction'] = row['instruction']
1390
+ results.append(data)
1391
+ except Exception as e:
1392
+ print(f"Parse error: {e}")
1393
+ results.append({"winner": "TIE", "score_a": 5, "score_b": 5, "reasoning": "Error parsing judge response"})
1394
 
1395
+ progress_bar.progress((i + 1) / total)
1396
 
1397
+ st.session_state.eval_results = results
1398
+ st.session_state.pipeline_status['evaluation'] = 'complete'
1399
+ status_text.text("βœ… Evaluation Complete!")
1400
+
1401
+ except Exception as e:
1402
+ st.error(f"Evaluation Failed: {str(e)}")
1403
+ st.session_state.pipeline_status['evaluation'] = 'error'
1404
+
1405
+ # ── Display Results ──
1406
+ if st.session_state.get('eval_results'):
1407
+ results = st.session_state.eval_results
1408
+ df_res = pd.DataFrame(results)
1409
+
1410
+ # Metrics
1411
+ wins_b = len(df_res[df_res['winner'] == 'B'])
1412
+ wins_a = len(df_res[df_res['winner'] == 'A'])
1413
+ ties = len(df_res[df_res['winner'] == 'TIE'])
1414
+ win_rate = (wins_b / len(df_res)) * 100
1415
+
1416
+ col1, col2, col3, col4 = st.columns(4)
1417
+ col1.metric("Fine-tuned Win Rate", f"{win_rate:.1f}%")
1418
+ col2.metric("Fine-Tuned Wins", wins_b)
1419
+ col3.metric("Base Model Wins", wins_a)
1420
+ col4.metric("Avg Score Improvement", f"{df_res['score_b'].mean() - df_res['score_a'].mean():.2f}")
1421
+
1422
+ # Charts
1423
+ c1, c2 = st.columns(2)
1424
+ with c1:
1425
+ fig = px.pie(values=[wins_b, wins_a, ties], names=['Fine-tuned', 'Base', 'Ties'],
1426
+ title="Win Distribution", color_discrete_sequence=['#6366f1', '#ef4444', '#94a3b8'])
1427
+ st.plotly_chart(fig, use_container_width=True)
1428
+
1429
+ with c2:
1430
+ avg_scores = pd.DataFrame({
1431
+ 'Model': ['Base', 'Fine-tuned'],
1432
+ 'Score': [df_res['score_a'].mean(), df_res['score_b'].mean()]
1433
+ })
1434
+ fig2 = px.bar(avg_scores, x='Model', y='Score', color='Model',
1435
+ title="Average Overall Score", color_discrete_map={'Base': '#ef4444', 'Fine-tuned': '#6366f1'})
1436
+ st.plotly_chart(fig2, use_container_width=True)
1437
+
1438
+ # Detailed Table
1439
+ st.markdown("### πŸ“ Detailed Verdicts")
1440
+ st.dataframe(df_res[['instruction', 'winner', 'score_a', 'score_b', 'reasoning']], use_container_width=True)
1441
+
1442
+ # Download
1443
+ st.download_button("⬇️ Download Report (JSON)",
1444
+ data=json.dumps(results, indent=2),
1445
+ file_name="evaluation_report.json",
1446
+ mime="application/json")
1447
 
1448
 
1449
  # ============================================================================