Spaces:
Configuration error
Configuration error
feat: implement dynamic evaluation page (remove static charts)
Browse files- PROJECT_HIGHLIGHTS.md +0 -54
- app.py +186 -127
PROJECT_HIGHLIGHTS.md
DELETED
|
@@ -1,54 +0,0 @@
|
|
| 1 |
-
# π Auto-FineTune-Ops: Project Highlights
|
| 2 |
-
|
| 3 |
-
**Autonomous Machine Learning Pipeline for Production-Grade LLM Fine-Tuning**
|
| 4 |
-
|
| 5 |
-
Auto-FineTune-Ops is a comprehensive, no-code/low-code platform that democratizes access to state-of-the-art LLM fine-tuning. It automates the complex lifecycle of data preparation, training, evaluation, and deployment.
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## π Key Features
|
| 10 |
-
|
| 11 |
-
### 1. π§ Intelligent Preprocessing Engine
|
| 12 |
-
A modular, production-ready data pipeline with 10+ specialized modules:
|
| 13 |
-
- **Text Cleaning:** Auto-strip HTML, emojis, URLs, and normalize whitespace.
|
| 14 |
-
- **PII Redaction:** Detect and mask emails, phone numbers, and keys for security.
|
| 15 |
-
- **Deduplication:** Remove exact and semantic duplicates (using TF-IDF/Cosine Similarity).
|
| 16 |
-
- **Quality Filtering:** Filter by language, toxicity, and length constraints.
|
| 17 |
-
- **Advanced Formatting:** Auto-convert loose CSV/JSON into strict Chat Templates (ShareGPT/OpenAI).
|
| 18 |
-
|
| 19 |
-
### 2. β‘ Hybrid Training Ecosystem
|
| 20 |
-
Flexible training workflows designed for all hardware setups:
|
| 21 |
-
- **Local GPU Power:** Leverages **Unsloth** for 2x faster training and 70% less memory usage (4-bit quantization).
|
| 22 |
-
- **Google Colab Bridge:** Seamless "No-GPU" fallback flow. Generate a ready-to-run Colab notebook to train on free cloud GPUs if local hardware is insufficient.
|
| 23 |
-
- **Custom Model Support:** Fine-tune any HuggingFace model (Llama 3, Mistral, Gemma, Phi-3, etc.).
|
| 24 |
-
|
| 25 |
-
### 3. βοΈ Multi-Provider AI Judge Arena
|
| 26 |
-
Production-grade model evaluation using LLM-as-a-Judge:
|
| 27 |
-
- **Provider Agnostic:** Supports OpenAI (GPT-4o), Anthropic (Claude 3.5), Google (Gemini 1.5), and Groq (Llama 3).
|
| 28 |
-
- **Custom Endpoints:** Connect to local LLMs (Ollama/vLLM) as judges.
|
| 29 |
-
- **Comprehensive Metrics:** Automated scoring for Accuracy, Helpfulness, Clarity, and Tone.
|
| 30 |
-
- **Head-to-Head:** Win-rate visualization comparing Base Model vs. Fine-Tuned Model.
|
| 31 |
-
|
| 32 |
-
### 4. π₯οΈ Interactive Streamlit Dashboard
|
| 33 |
-
A premium, dark-mode UI that abstracts away CLI complexity:
|
| 34 |
-
- **Project Management:** Manage datasets, models, and logs visually.
|
| 35 |
-
- **Real-time Monitoring:** Track training loss and progress live.
|
| 36 |
-
- **Visualization:** Interactive Plotly charts for evaluation results.
|
| 37 |
-
|
| 38 |
-
### 5. π One-Click Deployment
|
| 39 |
-
- **Instant API:** Export trained models as a production-ready **FastAPI** microservice.
|
| 40 |
-
- **Standardized Interface:** OpenAI-compatible `/generate` endpoints for easy integration into apps.
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
## π§ Technical Stack
|
| 45 |
-
- **Frontend:** Streamlit, Plotly
|
| 46 |
-
- **Core ML:** PyTorch, Transformers, PEFT, Unsloth, TRL
|
| 47 |
-
- **Data:** Pandas, NumPy, Scikit-learn
|
| 48 |
-
- **API:** FastAPI, Uvicorn
|
| 49 |
-
- **LLM Clients:** OpenAI SDK, Anthropic SDK
|
| 50 |
-
|
| 51 |
-
## π‘οΈ Production Readiness
|
| 52 |
-
- **Modular Architecture:** Agent-based design (DataArchitect, TrainingPilot, TheJudge) allows easy extensibility.
|
| 53 |
-
- **Error Handling:** Robust fallback mechanisms and detailed logging.
|
| 54 |
-
- **Security:** PII masking and API key management best practices.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1213,178 +1213,237 @@ def render_training():
|
|
| 1213 |
def render_evaluation():
|
| 1214 |
st.markdown('<p class="gradient-header">βοΈ Model Evaluation</p>', unsafe_allow_html=True)
|
| 1215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1216 |
# ββ Judge Provider Selection ββ
|
| 1217 |
st.markdown("### π€ Select AI Judge Provider")
|
| 1218 |
-
st.caption("Choose which LLM provider to use as the evaluation judge.
|
| 1219 |
|
| 1220 |
judge_provider = st.selectbox("AI Provider", [
|
| 1221 |
"OpenAI (GPT-4o, GPT-4-turbo, etc.)",
|
| 1222 |
"Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
|
| 1223 |
-
"
|
| 1224 |
-
"Groq (Llama, Mixtral, Gemma, etc.)",
|
| 1225 |
"Custom OpenAI-Compatible Endpoint"
|
| 1226 |
-
], help="Select the AI provider whose model will act as the judge
|
| 1227 |
|
| 1228 |
st.markdown("---")
|
| 1229 |
st.markdown("### π API Configuration")
|
| 1230 |
|
|
|
|
|
|
|
|
|
|
| 1231 |
if "OpenAI" in judge_provider:
|
| 1232 |
col1, col2 = st.columns(2)
|
| 1233 |
with col1:
|
| 1234 |
-
|
| 1235 |
-
|
| 1236 |
-
if openai_key:
|
| 1237 |
-
os.environ["OPENAI_API_KEY"] = openai_key
|
| 1238 |
with col2:
|
| 1239 |
-
judge_model = st.selectbox("Judge Model", [
|
| 1240 |
-
"gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"
|
| 1241 |
-
])
|
| 1242 |
|
| 1243 |
elif "Anthropic" in judge_provider:
|
| 1244 |
col1, col2 = st.columns(2)
|
| 1245 |
with col1:
|
| 1246 |
-
|
| 1247 |
-
|
| 1248 |
-
if anthropic_key:
|
| 1249 |
-
os.environ["ANTHROPIC_API_KEY"] = anthropic_key
|
| 1250 |
-
with col2:
|
| 1251 |
-
judge_model = st.selectbox("Judge Model", [
|
| 1252 |
-
"claude-3-5-sonnet-20241022", "claude-3-opus-20240229",
|
| 1253 |
-
"claude-3-sonnet-20240229", "claude-3-haiku-20240307"
|
| 1254 |
-
])
|
| 1255 |
-
|
| 1256 |
-
elif "Gemini" in judge_provider:
|
| 1257 |
-
col1, col2 = st.columns(2)
|
| 1258 |
-
with col1:
|
| 1259 |
-
gemini_key = st.text_input("Google AI API Key", type="password",
|
| 1260 |
-
help="Your Google AI Studio API key for Gemini models")
|
| 1261 |
-
if gemini_key:
|
| 1262 |
-
os.environ["GOOGLE_API_KEY"] = gemini_key
|
| 1263 |
with col2:
|
| 1264 |
-
judge_model = st.selectbox("Judge Model", [
|
| 1265 |
-
"gemini-1.5-pro", "gemini-1.5-flash", "gemini-pro"
|
| 1266 |
-
])
|
| 1267 |
|
| 1268 |
elif "Groq" in judge_provider:
|
| 1269 |
col1, col2 = st.columns(2)
|
| 1270 |
with col1:
|
| 1271 |
-
|
| 1272 |
-
|
| 1273 |
-
if groq_key:
|
| 1274 |
-
os.environ["GROQ_API_KEY"] = groq_key
|
| 1275 |
with col2:
|
| 1276 |
-
judge_model = st.selectbox("Judge Model", [
|
| 1277 |
-
|
| 1278 |
-
"mixtral-8x7b-32768", "gemma2-9b-it"
|
| 1279 |
-
])
|
| 1280 |
|
| 1281 |
-
else: # Custom
|
| 1282 |
col1, col2 = st.columns(2)
|
| 1283 |
with col1:
|
| 1284 |
-
|
| 1285 |
-
|
| 1286 |
-
|
| 1287 |
-
custom_api_key = st.text_input("API Key", type="password",
|
| 1288 |
-
help="API key for the custom endpoint (use 'none' for local servers)")
|
| 1289 |
-
if custom_api_key:
|
| 1290 |
-
os.environ["OPENAI_API_KEY"] = custom_api_key
|
| 1291 |
-
if custom_base_url:
|
| 1292 |
-
os.environ["OPENAI_BASE_URL"] = custom_base_url
|
| 1293 |
with col2:
|
| 1294 |
-
judge_model = st.text_input("Model Name",
|
| 1295 |
-
placeholder="e.g., my-model, llama-3-8b",
|
| 1296 |
-
help="Model identifier used by your custom endpoint")
|
| 1297 |
|
| 1298 |
st.markdown("---")
|
| 1299 |
|
| 1300 |
-
# ββ
|
| 1301 |
st.markdown("### π Evaluation Data")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1302 |
|
| 1303 |
-
|
| 1304 |
-
|
| 1305 |
-
|
| 1306 |
-
|
| 1307 |
-
|
| 1308 |
-
# Upload evaluation data
|
| 1309 |
-
eval_upload = st.file_uploader("Upload evaluation data (JSONL with instruction + model output)",
|
| 1310 |
-
type=['jsonl', 'json'], key="eval_data_upload",
|
| 1311 |
-
help="Upload a JSONL file containing instruction-response pairs to evaluate")
|
| 1312 |
if eval_upload:
|
| 1313 |
try:
|
| 1314 |
-
|
| 1315 |
-
|
| 1316 |
-
|
| 1317 |
-
|
|
|
|
|
|
|
|
|
|
| 1318 |
except Exception as e:
|
| 1319 |
-
st.error(f"Error loading
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1320 |
|
| 1321 |
st.markdown("---")
|
| 1322 |
|
| 1323 |
-
# ββ
|
| 1324 |
-
st.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1325 |
|
| 1326 |
-
|
| 1327 |
-
|
| 1328 |
-
|
| 1329 |
-
|
| 1330 |
-
|
| 1331 |
-
|
| 1332 |
-
|
| 1333 |
-
|
| 1334 |
-
fig.update_layout(
|
| 1335 |
-
title="Win Rate Distribution",
|
| 1336 |
-
paper_bgcolor='rgba(0,0,0,0)',
|
| 1337 |
-
plot_bgcolor='rgba(0,0,0,0)',
|
| 1338 |
-
font_color='#e2e8f0',
|
| 1339 |
-
showlegend=True
|
| 1340 |
-
)
|
| 1341 |
-
st.plotly_chart(fig, use_container_width=True)
|
| 1342 |
|
| 1343 |
-
|
| 1344 |
-
|
| 1345 |
-
|
| 1346 |
-
|
| 1347 |
-
|
| 1348 |
-
|
| 1349 |
-
|
| 1350 |
-
|
| 1351 |
-
|
| 1352 |
-
|
| 1353 |
-
|
| 1354 |
-
|
| 1355 |
-
)
|
| 1356 |
-
st.plotly_chart(fig, use_container_width=True)
|
| 1357 |
|
| 1358 |
-
|
| 1359 |
-
|
| 1360 |
-
|
| 1361 |
-
|
| 1362 |
-
|
| 1363 |
-
|
| 1364 |
-
|
| 1365 |
-
|
| 1366 |
-
|
| 1367 |
-
|
|
|
|
|
|
|
|
|
|
| 1368 |
|
| 1369 |
-
|
| 1370 |
|
| 1371 |
-
|
| 1372 |
-
|
| 1373 |
-
|
| 1374 |
-
|
| 1375 |
-
|
| 1376 |
-
|
| 1377 |
-
|
| 1378 |
-
|
| 1379 |
-
|
| 1380 |
-
|
| 1381 |
-
|
| 1382 |
-
|
| 1383 |
-
|
| 1384 |
-
|
| 1385 |
-
|
| 1386 |
-
|
| 1387 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1388 |
|
| 1389 |
|
| 1390 |
# ============================================================================
|
|
|
|
| 1213 |
def render_evaluation():
|
| 1214 |
st.markdown('<p class="gradient-header">βοΈ Model Evaluation</p>', unsafe_allow_html=True)
|
| 1215 |
|
| 1216 |
+
# Initialize session state for results if not present
|
| 1217 |
+
if 'eval_results' not in st.session_state:
|
| 1218 |
+
st.session_state.eval_results = None
|
| 1219 |
+
|
| 1220 |
# ββ Judge Provider Selection ββ
|
| 1221 |
st.markdown("### π€ Select AI Judge Provider")
|
| 1222 |
+
st.caption("Choose which LLM provider to use as the evaluation judge.")
|
| 1223 |
|
| 1224 |
judge_provider = st.selectbox("AI Provider", [
|
| 1225 |
"OpenAI (GPT-4o, GPT-4-turbo, etc.)",
|
| 1226 |
"Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
|
| 1227 |
+
"Groq (Llama 3, Mixtral, Gemma, etc.)",
|
|
|
|
| 1228 |
"Custom OpenAI-Compatible Endpoint"
|
| 1229 |
+
], help="Select the AI provider whose model will act as the judge.")
|
| 1230 |
|
| 1231 |
st.markdown("---")
|
| 1232 |
st.markdown("### π API Configuration")
|
| 1233 |
|
| 1234 |
+
api_key = None
|
| 1235 |
+
base_url = None
|
| 1236 |
+
|
| 1237 |
if "OpenAI" in judge_provider:
|
| 1238 |
col1, col2 = st.columns(2)
|
| 1239 |
with col1:
|
| 1240 |
+
api_key = st.text_input("OpenAI API Key", type="password", key="openai_key_input")
|
| 1241 |
+
if api_key: os.environ["OPENAI_API_KEY"] = api_key
|
|
|
|
|
|
|
| 1242 |
with col2:
|
| 1243 |
+
judge_model = st.selectbox("Judge Model", ["gpt-4o", "gpt-4-turbo", "gpt-3.5-turbo"])
|
|
|
|
|
|
|
| 1244 |
|
| 1245 |
elif "Anthropic" in judge_provider:
|
| 1246 |
col1, col2 = st.columns(2)
|
| 1247 |
with col1:
|
| 1248 |
+
api_key = st.text_input("Anthropic API Key", type="password", key="anthropic_key_input")
|
| 1249 |
+
if api_key: os.environ["ANTHROPIC_API_KEY"] = api_key
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1250 |
with col2:
|
| 1251 |
+
judge_model = st.selectbox("Judge Model", ["claude-3-5-sonnet-20241022", "claude-3-opus-20240229", "claude-3-sonnet-20240229"])
|
|
|
|
|
|
|
| 1252 |
|
| 1253 |
elif "Groq" in judge_provider:
|
| 1254 |
col1, col2 = st.columns(2)
|
| 1255 |
with col1:
|
| 1256 |
+
api_key = st.text_input("Groq API Key", type="password", key="groq_key_input")
|
| 1257 |
+
if api_key: os.environ["GROQ_API_KEY"] = api_key
|
|
|
|
|
|
|
| 1258 |
with col2:
|
| 1259 |
+
judge_model = st.selectbox("Judge Model", ["llama3-70b-8192", "llama3-8b-8192", "mixtral-8x7b-32768", "gemma-7b-it"])
|
| 1260 |
+
base_url = "https://api.groq.com/openai/v1"
|
|
|
|
|
|
|
| 1261 |
|
| 1262 |
+
else: # Custom
|
| 1263 |
col1, col2 = st.columns(2)
|
| 1264 |
with col1:
|
| 1265 |
+
base_url = st.text_input("API Base URL", placeholder="https://api.your-provider.com/v1")
|
| 1266 |
+
api_key = st.text_input("API Key", type="password", key="custom_key_input")
|
| 1267 |
+
if api_key: os.environ["OPENAI_API_KEY"] = api_key
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1268 |
with col2:
|
| 1269 |
+
judge_model = st.text_input("Model Name", placeholder="e.g., my-model")
|
|
|
|
|
|
|
| 1270 |
|
| 1271 |
st.markdown("---")
|
| 1272 |
|
| 1273 |
+
# ββ Evaluation Data ββ
|
| 1274 |
st.markdown("### π Evaluation Data")
|
| 1275 |
+
|
| 1276 |
+
# 1. Use data from training (if available)
|
| 1277 |
+
if st.session_state.model_path and "finetuned_outputs" in str(st.session_state.model_path):
|
| 1278 |
+
st.info(f"Using results from training: `{st.session_state.model_path}`")
|
| 1279 |
+
try:
|
| 1280 |
+
st.session_state['eval_data'] = pd.read_json(st.session_state.model_path, lines=True)
|
| 1281 |
+
except Exception:
|
| 1282 |
+
pass
|
| 1283 |
|
| 1284 |
+
# 2. Upload new data
|
| 1285 |
+
eval_upload = st.file_uploader("Upload JSONL (Must contain: 'instruction', 'base_output', 'finetuned_output')",
|
| 1286 |
+
type=['jsonl', 'json'], key="eval_uploader")
|
| 1287 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1288 |
if eval_upload:
|
| 1289 |
try:
|
| 1290 |
+
df = pd.read_json(eval_upload, lines=eval_upload.name.endswith('.jsonl'))
|
| 1291 |
+
required_cols = ['instruction', 'base_output', 'finetuned_output']
|
| 1292 |
+
if all(col in df.columns for col in required_cols):
|
| 1293 |
+
st.session_state['eval_data'] = df
|
| 1294 |
+
st.success(f"β
Loaded {len(df)} samples")
|
| 1295 |
+
else:
|
| 1296 |
+
st.error(f"β Missing columns! Found: {list(df.columns)}. Required: {required_cols}")
|
| 1297 |
except Exception as e:
|
| 1298 |
+
st.error(f"Error loading file: {e}")
|
| 1299 |
+
|
| 1300 |
+
# Show Preview
|
| 1301 |
+
if st.session_state.get('eval_data') is not None:
|
| 1302 |
+
with st.expander("ποΈ View Data Preview"):
|
| 1303 |
+
st.dataframe(st.session_state['eval_data'].head(3), use_container_width=True)
|
| 1304 |
|
| 1305 |
st.markdown("---")
|
| 1306 |
|
| 1307 |
+
# ββ Run Evaluation ββ
|
| 1308 |
+
if st.button("π Run Dynamic Evaluation", type="primary", use_container_width=True):
|
| 1309 |
+
if not api_key:
|
| 1310 |
+
st.error("β Please provide an API Key above!")
|
| 1311 |
+
return
|
| 1312 |
+
|
| 1313 |
+
if st.session_state.get('eval_data') is None:
|
| 1314 |
+
st.error("β No evaluation data loaded!")
|
| 1315 |
+
return
|
| 1316 |
+
|
| 1317 |
+
# Prepare Judge
|
| 1318 |
+
st.session_state.pipeline_status['evaluation'] = 'running'
|
| 1319 |
+
progress_bar = st.progress(0)
|
| 1320 |
+
status_text = st.empty()
|
| 1321 |
+
|
| 1322 |
+
results = []
|
| 1323 |
+
df = st.session_state['eval_data']
|
| 1324 |
+
total = len(df)
|
| 1325 |
+
|
| 1326 |
+
try:
|
| 1327 |
+
# Initialize Client
|
| 1328 |
+
client = None
|
| 1329 |
+
if "Anthropic" in judge_provider:
|
| 1330 |
+
from anthropic import Anthropic
|
| 1331 |
+
client = Anthropic(api_key=api_key)
|
| 1332 |
+
else:
|
| 1333 |
+
from openai import OpenAI
|
| 1334 |
+
client = OpenAI(api_key=api_key, base_url=base_url)
|
| 1335 |
+
|
| 1336 |
+
JUDGE_PROMPT = """You are an expert evaluator comparing two AI responses.
|
| 1337 |
+
|
| 1338 |
+
Query: {prompt}
|
| 1339 |
+
|
| 1340 |
+
Response A (Base Model):
|
| 1341 |
+
{response_a}
|
| 1342 |
+
|
| 1343 |
+
Response B (Fine-tuned Model):
|
| 1344 |
+
{response_b}
|
| 1345 |
+
|
| 1346 |
+
Compare them on: Accuracy, Helpfulness, Clarity.
|
| 1347 |
+
Return a valid JSON object ONLY:
|
| 1348 |
+
{{
|
| 1349 |
+
"winner": "A" or "B" or "TIE",
|
| 1350 |
+
"score_a": <1-10>,
|
| 1351 |
+
"score_b": <1-10>,
|
| 1352 |
+
"reasoning": "short explanation",
|
| 1353 |
+
"accuracy": {{"A": <1-10>, "B": <1-10>}},
|
| 1354 |
+
"helpfulness": {{"A": <1-10>, "B": <1-10>}},
|
| 1355 |
+
"clarity": {{"A": <1-10>, "B": <1-10>}}
|
| 1356 |
+
}}
|
| 1357 |
+
"""
|
| 1358 |
|
| 1359 |
+
for i, row in df.iterrows():
|
| 1360 |
+
status_text.text(f"Evaluating sample {i+1}/{total}...")
|
| 1361 |
+
|
| 1362 |
+
prompt_text = JUDGE_PROMPT.format(
|
| 1363 |
+
prompt=row['instruction'],
|
| 1364 |
+
response_a=row['base_output'],
|
| 1365 |
+
response_b=row['finetuned_output']
|
| 1366 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1367 |
|
| 1368 |
+
# Call API
|
| 1369 |
+
if "Anthropic" in judge_provider:
|
| 1370 |
+
resp = client.messages.create(
|
| 1371 |
+
model=judge_model, max_tokens=1000,
|
| 1372 |
+
messages=[{"role": "user", "content": prompt_text}]
|
| 1373 |
+
).content[0].text
|
| 1374 |
+
else:
|
| 1375 |
+
resp = client.chat.completions.create(
|
| 1376 |
+
model=judge_model, max_tokens=1000,
|
| 1377 |
+
messages=[{"role": "user", "content": prompt_text}],
|
| 1378 |
+
response_format={"type": "json_object"}
|
| 1379 |
+
).choices[0].message.content
|
|
|
|
|
|
|
| 1380 |
|
| 1381 |
+
# Parse
|
| 1382 |
+
try:
|
| 1383 |
+
import json
|
| 1384 |
+
# Clean json string if needed
|
| 1385 |
+
if "```json" in resp: resp = resp.split("```json")[1].split("```")[0]
|
| 1386 |
+
if "```" in resp: resp = resp.split("```")[1]
|
| 1387 |
+
|
| 1388 |
+
data = json.loads(resp.strip())
|
| 1389 |
+
data['instruction'] = row['instruction']
|
| 1390 |
+
results.append(data)
|
| 1391 |
+
except Exception as e:
|
| 1392 |
+
print(f"Parse error: {e}")
|
| 1393 |
+
results.append({"winner": "TIE", "score_a": 5, "score_b": 5, "reasoning": "Error parsing judge response"})
|
| 1394 |
|
| 1395 |
+
progress_bar.progress((i + 1) / total)
|
| 1396 |
|
| 1397 |
+
st.session_state.eval_results = results
|
| 1398 |
+
st.session_state.pipeline_status['evaluation'] = 'complete'
|
| 1399 |
+
status_text.text("β
Evaluation Complete!")
|
| 1400 |
+
|
| 1401 |
+
except Exception as e:
|
| 1402 |
+
st.error(f"Evaluation Failed: {str(e)}")
|
| 1403 |
+
st.session_state.pipeline_status['evaluation'] = 'error'
|
| 1404 |
+
|
| 1405 |
+
# ββ Display Results ββ
|
| 1406 |
+
if st.session_state.get('eval_results'):
|
| 1407 |
+
results = st.session_state.eval_results
|
| 1408 |
+
df_res = pd.DataFrame(results)
|
| 1409 |
+
|
| 1410 |
+
# Metrics
|
| 1411 |
+
wins_b = len(df_res[df_res['winner'] == 'B'])
|
| 1412 |
+
wins_a = len(df_res[df_res['winner'] == 'A'])
|
| 1413 |
+
ties = len(df_res[df_res['winner'] == 'TIE'])
|
| 1414 |
+
win_rate = (wins_b / len(df_res)) * 100
|
| 1415 |
+
|
| 1416 |
+
col1, col2, col3, col4 = st.columns(4)
|
| 1417 |
+
col1.metric("Fine-tuned Win Rate", f"{win_rate:.1f}%")
|
| 1418 |
+
col2.metric("Fine-Tuned Wins", wins_b)
|
| 1419 |
+
col3.metric("Base Model Wins", wins_a)
|
| 1420 |
+
col4.metric("Avg Score Improvement", f"{df_res['score_b'].mean() - df_res['score_a'].mean():.2f}")
|
| 1421 |
+
|
| 1422 |
+
# Charts
|
| 1423 |
+
c1, c2 = st.columns(2)
|
| 1424 |
+
with c1:
|
| 1425 |
+
fig = px.pie(values=[wins_b, wins_a, ties], names=['Fine-tuned', 'Base', 'Ties'],
|
| 1426 |
+
title="Win Distribution", color_discrete_sequence=['#6366f1', '#ef4444', '#94a3b8'])
|
| 1427 |
+
st.plotly_chart(fig, use_container_width=True)
|
| 1428 |
+
|
| 1429 |
+
with c2:
|
| 1430 |
+
avg_scores = pd.DataFrame({
|
| 1431 |
+
'Model': ['Base', 'Fine-tuned'],
|
| 1432 |
+
'Score': [df_res['score_a'].mean(), df_res['score_b'].mean()]
|
| 1433 |
+
})
|
| 1434 |
+
fig2 = px.bar(avg_scores, x='Model', y='Score', color='Model',
|
| 1435 |
+
title="Average Overall Score", color_discrete_map={'Base': '#ef4444', 'Fine-tuned': '#6366f1'})
|
| 1436 |
+
st.plotly_chart(fig2, use_container_width=True)
|
| 1437 |
+
|
| 1438 |
+
# Detailed Table
|
| 1439 |
+
st.markdown("### π Detailed Verdicts")
|
| 1440 |
+
st.dataframe(df_res[['instruction', 'winner', 'score_a', 'score_b', 'reasoning']], use_container_width=True)
|
| 1441 |
+
|
| 1442 |
+
# Download
|
| 1443 |
+
st.download_button("β¬οΈ Download Report (JSON)",
|
| 1444 |
+
data=json.dumps(results, indent=2),
|
| 1445 |
+
file_name="evaluation_report.json",
|
| 1446 |
+
mime="application/json")
|
| 1447 |
|
| 1448 |
|
| 1449 |
# ============================================================================
|