.gitattributes CHANGED
@@ -1,2 +1,35 @@
1
- # Auto detect text files and perform LF normalization
2
- * text=auto
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.github/prompts/plan-llmCompare.prompt.md DELETED
@@ -1,69 +0,0 @@
1
- ## Plan: LLM Comparison Web App (Gradio)
2
-
3
- Build a Gradio Blocks app with two-column side-by-side LLM comparison. Left: user's custom model via OpenAI-compatible API endpoint. Right: selectable provider models (OpenAI, Anthropic, Gemini, Qwen, Yi) with default API keys from HF Spaces secrets. Users enter a nickname, prompt both models, then comment and grade (1-10) each response. All evaluations persist to SQLite. Admin can download all data as Excel (.xlsx). Deploy on HuggingFace Spaces.
4
-
5
- ---
6
-
7
- ### Phase 1: Project Setup
8
- 1. Create `requirements.txt` with: `gradio`, `openai`, `anthropic`, `google-generativeai`, `openpyxl`
9
- 2. Update `README.md` with project description and setup instructions
10
-
11
- ### Phase 2: Database Layer — `db.py`
12
- 3. Create SQLite helper with `init_db()` to create the `evaluations` table with columns: `id`, `timestamp`, `nickname`, `prompt`, `left_model_name`, `left_model_endpoint`, `left_response`, `left_comment`, `left_grade`, `right_model_name`, `right_provider`, `right_response`, `right_comment`, `right_grade`
13
- 4. Add `save_evaluation(...)` function to insert a row
14
- 5. Add `export_to_excel(filepath)` function using `openpyxl` to dump all rows to .xlsx
15
-
16
- ### Phase 3: LLM Provider Abstraction — `providers.py`
17
- 6. Define a model registry dict mapping display name → (provider, model_id, base_url, env_var_name):
18
- - **OpenAI** (`gpt-4o`, `gpt-4o-mini`): `openai` SDK, default base
19
- - **Anthropic** (`claude-sonnet-4-20250514`): `anthropic` SDK
20
- - **Google Gemini** (`gemini-2.0-flash`): `google-generativeai` SDK
21
- - **Qwen** (`qwen-plus`): `openai` SDK with DashScope base URL
22
- - **Yi** (`yi-large`): `openai` SDK with 01.AI base URL
23
- 7. Implement `call_model(provider, model_name, prompt, api_key)` — dispatches to the correct SDK, falls back to env var key if user key is empty
24
- 8. Implement `call_custom_endpoint(base_url, model_name, prompt, api_key)` — uses `openai` SDK with user-supplied base_url for the left-side custom model
25
-
26
- ### Phase 4: Gradio UI — `app.py`
27
- 9. Build Gradio Blocks layout:
28
- - **Top bar**: Nickname text input (required)
29
- - **Prompt area**: Shared textbox + "Send to both" button
30
- - **Two-column `gr.Row`**:
31
- - **Left** ("Your Model"): API endpoint URL, model name, API key, response display, comment textbox, grade slider (1-10)
32
- - **Right** ("Reference Model"): model dropdown (from registry), API key (optional, default provided), response display, comment textbox, grade slider (1-10)
33
- - **Submit Evaluation** button → saves to SQLite
34
- - **Download Report** button → exports .xlsx file
35
- 10. Wire "Send to both" → calls both models, displays responses
36
- 11. Wire "Submit Evaluation" → validates inputs, saves to DB, shows success notification
37
- 12. Wire "Download Report" → exports SQLite to temp .xlsx, returns as `gr.File`
38
-
39
- ### Phase 5: Security & Configuration
40
- 13. Default API keys from env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `DASHSCOPE_API_KEY`, `YI_API_KEY`), set as HF Spaces secrets. User-provided keys override per-session only — never stored. All keys processed server-side only.
41
- 14. Input sanitization: validate URL format for left endpoint, sanitize nickname (max 50 chars)
42
-
43
- ### Phase 6: Deployment
44
- 15. Create HuggingFace Space (Gradio SDK), push code
45
- 16. Set repository secrets for default API keys
46
- 17. End-to-end test on live Space
47
-
48
- ---
49
-
50
- **Relevant files**
51
- - `app.py` — Main Gradio Blocks UI, event wiring, layout (new)
52
- - `db.py` — SQLite init, save, export functions (new)
53
- - `providers.py` — Model registry, API call dispatch (new)
54
- - `requirements.txt` — Python dependencies (new)
55
- - `README.md` — Update with project info (existing)
56
-
57
- **Verification**
58
- 1. Launch locally with `python app.py`, verify two-column layout renders
59
- 2. Test left column with a local OpenAI-compatible endpoint (e.g. Ollama)
60
- 3. Test right column with each provider using default keys
61
- 4. Submit evaluation → verify row in SQLite
62
- 5. Download report → verify .xlsx has all columns populated
63
- 6. Test validation (missing nickname, missing grade → error)
64
- 7. Deploy to HF Spaces, set secrets, run full end-to-end
65
-
66
- **Further Considerations**
67
- 1. **SQLite persistence on HF Spaces**: Ephemeral storage resets on restart. Recommend enabling persistent storage and placing DB under `/data`. Alternative: periodic backup to HF Dataset.
68
- 2. **Rate limiting**: Consider adding per-nickname rate limiting to prevent abuse of default API keys.
69
- 3. **Streaming responses**: Initial version uses non-streaming calls; streaming can be added later for better UX.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.gitignore DELETED
@@ -1,5 +0,0 @@
1
- __pycache__/
2
- *.pyc
3
- evaluations.db
4
- *.xlsx
5
- .DS_Store
 
 
 
 
 
 
README.md CHANGED
@@ -1,82 +1,13 @@
1
  ---
2
- title: LLM Compare
3
- emoji: 🔍
4
- colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: "6.9.0"
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- # LLM Compare
13
-
14
- A Gradio web app for side-by-side LLM comparison. Compare your Dify application against reference models from OpenAI, Anthropic, Google Gemini, Qwen, and Yi.
15
-
16
- ## Features
17
-
18
- - **Two-column layout**: Your Dify app on the left, a selectable reference model on the right
19
- - **Multiple providers**: OpenAI (GPT-4o), Anthropic (Claude), Google Gemini, Qwen, Yi
20
- - **Overridable defaults**: Base URL and Model ID auto-fill from env vars but can be edited per-session
21
- - **Evaluation workflow**: Comment and grade (1–10) each model's response
22
- - **Nickname tracking**: All evaluations tagged with user nickname
23
- - **Excel export**: Download all evaluation data as `.xlsx`
24
-
25
- ## Setup
26
-
27
- ```bash
28
- pip install -r requirements.txt
29
- python app.py
30
- ```
31
-
32
- ## Environment Variables
33
-
34
- Set these as **Hugging Face Spaces secrets** (Settings → Repository secrets) to provide defaults.
35
- Users can override Base URL / Model ID in the UI at runtime.
36
-
37
- ### API Keys (required for each provider you use)
38
-
39
- | Variable | Provider |
40
- |---|---|
41
- | `OPENAI_API_KEY` | OpenAI |
42
- | `ANTHROPIC_API_KEY` | Anthropic |
43
- | `GOOGLE_API_KEY` | Google Gemini |
44
- | `DASHSCOPE_API_KEY` | Qwen (DashScope / Alibaba) |
45
- | `YI_API_KEY` | Yi (01.AI) |
46
-
47
- ### Base URL overrides (optional)
48
-
49
- Override the default API endpoint for each provider. Useful for proxies or custom deployments.
50
-
51
- | Variable | Default |
52
- |---|---|
53
- | `OPENAI_BASE_URL` | *(uses OpenAI SDK default)* |
54
- | `ANTHROPIC_BASE_URL` | *(uses Anthropic SDK default)* |
55
- | `GOOGLE_BASE_URL` | *(uses Google GenAI SDK default)* |
56
- | `DASHSCOPE_BASE_URL` | `https://dashscope.aliyuncs.com/compatible-mode/v1` |
57
- | `YI_BASE_URL` | `https://api.01.ai/v1` |
58
-
59
- ### Model ID overrides (optional)
60
-
61
- Override the default model ID. Useful for switching to newer model versions without code changes.
62
-
63
- | Variable | Default |
64
- |---|---|
65
- | `OPENAI_MODEL_ID` | `gpt-4o` |
66
- | `OPENAI_MINI_MODEL_ID` | `gpt-4o-mini` |
67
- | `ANTHROPIC_MODEL_ID` | `claude-sonnet-4-20250514` |
68
- | `GOOGLE_MODEL_ID` | `gemini-2.0-flash` |
69
- | `DASHSCOPE_MODEL_ID` | `qwen-plus` |
70
- | `YI_MODEL_ID` | `yi-large` |
71
-
72
- ## How it works
73
-
74
- 1. Select a reference model from the dropdown — **Base URL** and **Model ID** auto-fill from env vars (or registry defaults)
75
- 2. Edit Base URL / Model ID if needed (changes apply to current session only)
76
- 3. Enter your prompt and click **Send to Both**
77
- 4. Grade and comment on each response, then **Submit Evaluation**
78
-
79
- ## Deployment
80
-
81
- Deploy on HuggingFace Spaces with Gradio SDK. Set the API keys and optional overrides as repository secrets in Settings.
82
-
 
1
  ---
2
+ title: Llm Compare
3
+ emoji: 👀
4
+ colorFrom: green
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 6.9.0
8
  app_file: app.py
9
  pinned: false
10
+ short_description: compares anti DV agent with other public agents
11
  ---
12
 
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
__pycache__/app.cpython-311.pyc DELETED
Binary file (9.94 kB)
 
__pycache__/db.cpython-311.pyc DELETED
Binary file (4.43 kB)
 
__pycache__/providers.cpython-311.pyc DELETED
Binary file (5.29 kB)
 
app.py DELETED
@@ -1,297 +0,0 @@
1
- import re
2
- import tempfile
3
- import gradio as gr
4
-
5
- from db import init_db, save_evaluation, export_to_excel
6
- from providers import (
7
- MODEL_NAMES,
8
- call_model,
9
- call_custom_endpoint,
10
- MODEL_REGISTRY,
11
- get_model_defaults,
12
- )
13
-
14
- # ---------------------------------------------------------------------------
15
- # Initialise database on import
16
- # ---------------------------------------------------------------------------
17
- init_db()
18
-
19
- # ---------------------------------------------------------------------------
20
- # Helpers
21
- # ---------------------------------------------------------------------------
22
- URL_RE = re.compile(r"^https?://\S+$")
23
-
24
-
25
- def _sanitize_nickname(nick: str) -> str:
26
- return nick.strip()[:50]
27
-
28
-
29
- def _validate_url(url: str) -> bool:
30
- return bool(URL_RE.match(url.strip()))
31
-
32
-
33
- def on_model_select(display_name: str):
34
- """When user picks a model from dropdown, populate base_url and model_id."""
35
- base_url, model_id = get_model_defaults(display_name)
36
- return base_url, model_id
37
-
38
-
39
- # ---------------------------------------------------------------------------
40
- # Event handlers
41
- # ---------------------------------------------------------------------------
42
-
43
- def send_to_both(
44
- prompt: str,
45
- left_url: str,
46
- left_model: str,
47
- left_key: str,
48
- right_name: str,
49
- right_base_url: str,
50
- right_model_id: str,
51
- right_key: str,
52
- ):
53
- """Call both models and return their responses."""
54
- if not prompt or not prompt.strip():
55
- raise gr.Error("Please enter a prompt.")
56
-
57
- # Left — Dify endpoint
58
- left_response = ""
59
- left_err = ""
60
- if left_url and left_url.strip():
61
- if not _validate_url(left_url):
62
- left_err = "⚠️ Invalid URL format. Use http:// or https://."
63
- else:
64
- try:
65
- left_response = call_custom_endpoint(
66
- left_url.strip(), left_model.strip() or "default", prompt, left_key
67
- )
68
- except Exception as e:
69
- left_err = f"⚠️ Left model error: {e}"
70
-
71
- # Right — registry model (with optional user overrides)
72
- right_response = ""
73
- right_err = ""
74
- try:
75
- right_response = call_model(
76
- right_name, prompt, right_key, right_base_url, right_model_id
77
- )
78
- except Exception as e:
79
- right_err = f"⚠️ Right model error: {e}"
80
-
81
- return (
82
- left_response if not left_err else left_err,
83
- right_response if not right_err else right_err,
84
- )
85
-
86
-
87
- def submit_evaluation(
88
- nickname: str,
89
- prompt: str,
90
- left_url: str,
91
- left_model: str,
92
- left_response: str,
93
- left_comment: str,
94
- left_grade: int,
95
- right_name: str,
96
- right_model_id: str,
97
- right_response: str,
98
- right_comment: str,
99
- right_grade: int,
100
- ):
101
- """Validate and persist an evaluation."""
102
- nickname = _sanitize_nickname(nickname)
103
- if not nickname:
104
- raise gr.Error("Nickname is required.")
105
- if not prompt or not prompt.strip():
106
- raise gr.Error("Prompt is empty — send a prompt first.")
107
- if not left_response.strip() and not right_response.strip():
108
- raise gr.Error("No responses to evaluate — send a prompt first.")
109
- if left_grade < 1 or left_grade > 10:
110
- raise gr.Error("Left grade must be between 1 and 10.")
111
- if right_grade < 1 or right_grade > 10:
112
- raise gr.Error("Right grade must be between 1 and 10.")
113
-
114
- entry = MODEL_REGISTRY.get(right_name, {})
115
- right_provider = entry.get("provider", "unknown")
116
-
117
- save_evaluation(
118
- nickname=nickname,
119
- prompt=prompt,
120
- left_model_name=left_model.strip() or "custom",
121
- left_model_endpoint=left_url.strip(),
122
- left_response=left_response,
123
- left_comment=left_comment,
124
- left_grade=int(left_grade),
125
- right_model_name=right_model_id.strip() or right_name,
126
- right_provider=right_provider,
127
- right_response=right_response,
128
- right_comment=right_comment,
129
- right_grade=int(right_grade),
130
- )
131
- gr.Info("✅ Evaluation saved!")
132
-
133
-
134
- def download_report():
135
- """Export all evaluations to a temp .xlsx and return as a downloadable file."""
136
- tmp = tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False)
137
- export_to_excel(tmp.name)
138
- return tmp.name
139
-
140
-
141
- # ---------------------------------------------------------------------------
142
- # Gradio Blocks UI
143
- # ---------------------------------------------------------------------------
144
-
145
- # Pre-compute initial defaults for first model
146
- _init_base_url, _init_model_id = get_model_defaults(MODEL_NAMES[0])
147
-
148
- with gr.Blocks(title="LLM Compare") as demo:
149
- gr.Markdown("# 🔍 LLM Compare\nSide-by-side comparison: your Dify app vs reference models.")
150
-
151
- # ---- Top bar: nickname ---------------------------------------------------
152
- with gr.Row():
153
- nickname = gr.Textbox(
154
- label="Your Nickname",
155
- placeholder="Enter a nickname (required)",
156
- scale=2,
157
- )
158
-
159
- # ---- Prompt area ---------------------------------------------------------
160
- with gr.Row():
161
- prompt = gr.Textbox(
162
- label="Prompt",
163
- placeholder="Type your prompt here…",
164
- lines=4,
165
- scale=4,
166
- )
167
- send_btn = gr.Button("🚀 Send to Both", variant="primary", scale=1)
168
-
169
- # ---- Two-column layout ---------------------------------------------------
170
- with gr.Row(equal_height=True):
171
- # ---- LEFT: Dify model ------------------------------------------------
172
- with gr.Column():
173
- gr.Markdown("### 🧪 Your Model (Dify Endpoint)")
174
- left_url = gr.Textbox(
175
- label="Dify API Base URL",
176
- placeholder="https://api.dify.ai/v1",
177
- )
178
- left_model = gr.Textbox(
179
- label="App Name (for display only)",
180
- placeholder="e.g. my-dify-app",
181
- )
182
- left_key = gr.Textbox(
183
- label="Dify Secret Key",
184
- placeholder="app-xxxxxxxxxxxx",
185
- type="password",
186
- )
187
- left_response = gr.Textbox(
188
- label="Response",
189
- lines=12,
190
- interactive=False,
191
- )
192
- left_comment = gr.Textbox(
193
- label="Comment",
194
- placeholder="Your thoughts on this response…",
195
- lines=2,
196
- )
197
- left_grade = gr.Slider(
198
- minimum=1,
199
- maximum=10,
200
- step=1,
201
- value=5,
202
- label="Grade (1–10)",
203
- )
204
-
205
- # ---- RIGHT: reference model ------------------------------------------
206
- with gr.Column():
207
- gr.Markdown("### 📚 Reference Model")
208
- right_name = gr.Dropdown(
209
- choices=MODEL_NAMES,
210
- value=MODEL_NAMES[0],
211
- label="Select Model",
212
- )
213
- right_base_url = gr.Textbox(
214
- label="Base URL (auto-filled, editable)",
215
- value=_init_base_url,
216
- placeholder="e.g. https://api.openai.com/v1",
217
- )
218
- right_model_id = gr.Textbox(
219
- label="Model ID (auto-filled, editable)",
220
- value=_init_model_id,
221
- placeholder="e.g. gpt-4o",
222
- )
223
- right_key = gr.Textbox(
224
- label="API Key (optional — uses env default)",
225
- placeholder="Leave blank to use default key",
226
- type="password",
227
- )
228
- right_response = gr.Textbox(
229
- label="Response",
230
- lines=12,
231
- interactive=False,
232
- )
233
- right_comment = gr.Textbox(
234
- label="Comment",
235
- placeholder="Your thoughts on this response…",
236
- lines=2,
237
- )
238
- right_grade = gr.Slider(
239
- minimum=1,
240
- maximum=10,
241
- step=1,
242
- value=5,
243
- label="Grade (1–10)",
244
- )
245
-
246
- # ---- Action buttons ------------------------------------------------------
247
- with gr.Row():
248
- submit_btn = gr.Button("💾 Submit Evaluation", variant="primary")
249
- download_btn = gr.Button("📥 Download Report (.xlsx)")
250
- report_file = gr.File(label="Report", visible=False)
251
-
252
- # ---- Wiring --------------------------------------------------------------
253
-
254
- # Auto-fill base_url and model_id when dropdown changes
255
- right_name.change(
256
- fn=on_model_select,
257
- inputs=[right_name],
258
- outputs=[right_base_url, right_model_id],
259
- )
260
-
261
- send_btn.click(
262
- fn=send_to_both,
263
- inputs=[
264
- prompt, left_url, left_model, left_key,
265
- right_name, right_base_url, right_model_id, right_key,
266
- ],
267
- outputs=[left_response, right_response],
268
- )
269
-
270
- submit_btn.click(
271
- fn=submit_evaluation,
272
- inputs=[
273
- nickname,
274
- prompt,
275
- left_url,
276
- left_model,
277
- left_response,
278
- left_comment,
279
- left_grade,
280
- right_name,
281
- right_model_id,
282
- right_response,
283
- right_comment,
284
- right_grade,
285
- ],
286
- outputs=[],
287
- )
288
-
289
- download_btn.click(
290
- fn=download_report,
291
- inputs=[],
292
- outputs=[report_file],
293
- ).then(lambda: gr.update(visible=True), outputs=[report_file])
294
-
295
-
296
- if __name__ == "__main__":
297
- demo.launch(theme=gr.themes.Soft())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db.py DELETED
@@ -1,100 +0,0 @@
1
- import sqlite3
2
- import os
3
- from datetime import datetime
4
-
5
- from openpyxl import Workbook
6
-
7
- DB_DIR = os.environ.get("DATA_DIR", ".")
8
- DB_PATH = os.path.join(DB_DIR, "evaluations.db")
9
-
10
-
11
- def _get_conn() -> sqlite3.Connection:
12
- conn = sqlite3.connect(DB_PATH)
13
- conn.execute("PRAGMA journal_mode=WAL")
14
- return conn
15
-
16
-
17
- def init_db() -> None:
18
- conn = _get_conn()
19
- conn.execute(
20
- """
21
- CREATE TABLE IF NOT EXISTS evaluations (
22
- id INTEGER PRIMARY KEY AUTOINCREMENT,
23
- timestamp TEXT NOT NULL,
24
- nickname TEXT NOT NULL,
25
- prompt TEXT NOT NULL,
26
- left_model_name TEXT NOT NULL,
27
- left_model_endpoint TEXT NOT NULL,
28
- left_response TEXT NOT NULL,
29
- left_comment TEXT NOT NULL DEFAULT '',
30
- left_grade INTEGER NOT NULL,
31
- right_model_name TEXT NOT NULL,
32
- right_provider TEXT NOT NULL,
33
- right_response TEXT NOT NULL,
34
- right_comment TEXT NOT NULL DEFAULT '',
35
- right_grade INTEGER NOT NULL
36
- )
37
- """
38
- )
39
- conn.commit()
40
- conn.close()
41
-
42
-
43
- def save_evaluation(
44
- nickname: str,
45
- prompt: str,
46
- left_model_name: str,
47
- left_model_endpoint: str,
48
- left_response: str,
49
- left_comment: str,
50
- left_grade: int,
51
- right_model_name: str,
52
- right_provider: str,
53
- right_response: str,
54
- right_comment: str,
55
- right_grade: int,
56
- ) -> None:
57
- conn = _get_conn()
58
- conn.execute(
59
- """
60
- INSERT INTO evaluations (
61
- timestamp, nickname, prompt,
62
- left_model_name, left_model_endpoint, left_response, left_comment, left_grade,
63
- right_model_name, right_provider, right_response, right_comment, right_grade
64
- ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
65
- """,
66
- (
67
- datetime.utcnow().isoformat(),
68
- nickname,
69
- prompt,
70
- left_model_name,
71
- left_model_endpoint,
72
- left_response,
73
- left_comment,
74
- left_grade,
75
- right_model_name,
76
- right_provider,
77
- right_response,
78
- right_comment,
79
- right_grade,
80
- ),
81
- )
82
- conn.commit()
83
- conn.close()
84
-
85
-
86
- def export_to_excel(filepath: str) -> str:
87
- conn = _get_conn()
88
- cursor = conn.execute("SELECT * FROM evaluations ORDER BY id")
89
- columns = [desc[0] for desc in cursor.description]
90
- rows = cursor.fetchall()
91
- conn.close()
92
-
93
- wb = Workbook()
94
- ws = wb.active
95
- ws.title = "Evaluations"
96
- ws.append(columns)
97
- for row in rows:
98
- ws.append(list(row))
99
- wb.save(filepath)
100
- return filepath
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evaluations.db DELETED
Binary file (12.3 kB)
 
llm_compare DELETED
@@ -1 +0,0 @@
1
- Subproject commit 75be778201f3a72ce3af88996ea7e433263a5f41
 
 
providers.py DELETED
@@ -1,184 +0,0 @@
1
- import os
2
- import requests
3
- from openai import OpenAI
4
- import anthropic
5
- from google import genai
6
-
7
- # ---------------------------------------------------------------------------
8
- # Model Registry
9
- # Each entry: display_name -> {provider, model_id, base_url (None = default), env_var}
10
- # ---------------------------------------------------------------------------
11
- MODEL_REGISTRY: dict[str, dict] = {
12
- "GPT-4o (OpenAI)": {
13
- "provider": "openai",
14
- "model_id": "gpt-4o",
15
- "base_url": None,
16
- "env_var": "OPENAI_API_KEY",
17
- "env_base_url": "OPENAI_BASE_URL",
18
- "env_model_id": "OPENAI_MODEL_ID",
19
- },
20
- "GPT-4o-mini (OpenAI)": {
21
- "provider": "openai",
22
- "model_id": "gpt-4o-mini",
23
- "base_url": None,
24
- "env_var": "OPENAI_API_KEY",
25
- "env_base_url": "OPENAI_BASE_URL",
26
- "env_model_id": "OPENAI_MINI_MODEL_ID",
27
- },
28
- "Claude Sonnet 4 (Anthropic)": {
29
- "provider": "anthropic",
30
- "model_id": "claude-sonnet-4-6",
31
- "base_url": None,
32
- "env_var": "ANTHROPIC_API_KEY",
33
- "env_base_url": "ANTHROPIC_BASE_URL",
34
- "env_model_id": "ANTHROPIC_MODEL_ID",
35
- },
36
- "Gemini 2.0 Flash (Google)": {
37
- "provider": "gemini",
38
- "model_id": "gemini-2.0-flash",
39
- "base_url": None,
40
- "env_var": "GOOGLE_API_KEY",
41
- "env_base_url": "GOOGLE_BASE_URL",
42
- "env_model_id": "GOOGLE_MODEL_ID",
43
- },
44
- "Qwen-Plus (Alibaba)": {
45
- "provider": "openai_compat",
46
- "model_id": "qwen-plus",
47
- "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
48
- "env_var": "DASHSCOPE_API_KEY",
49
- "env_base_url": "DASHSCOPE_BASE_URL",
50
- "env_model_id": "DASHSCOPE_MODEL_ID",
51
- },
52
- "Yi-Large (01.AI)": {
53
- "provider": "openai_compat",
54
- "model_id": "yi-large",
55
- "base_url": "https://api.01.ai/v1",
56
- "env_var": "YI_API_KEY",
57
- "env_base_url": "YI_BASE_URL",
58
- "env_model_id": "YI_MODEL_ID",
59
- },
60
- }
61
-
62
- MODEL_NAMES = list(MODEL_REGISTRY.keys())
63
-
64
-
65
- def get_model_defaults(display_name: str) -> tuple[str, str]:
66
- """Return (base_url, model_id) for a registry model, considering env overrides.
67
-
68
- Priority: env var > registry hardcoded value.
69
- """
70
- entry = MODEL_REGISTRY.get(display_name, {})
71
- base_url = os.environ.get(entry.get("env_base_url", ""), "") or entry.get("base_url") or ""
72
- model_id = os.environ.get(entry.get("env_model_id", ""), "") or entry.get("model_id", "")
73
- return base_url, model_id
74
-
75
-
76
- def _resolve_key(env_var: str, user_key: str | None) -> str:
77
- """Return user-provided key if non-empty, else fall back to env var."""
78
- if user_key and user_key.strip():
79
- return user_key.strip()
80
- key = os.environ.get(env_var, "")
81
- if not key:
82
- raise ValueError(
83
- f"No API key provided and environment variable {env_var} is not set."
84
- )
85
- return key
86
-
87
-
88
- # ---------------------------------------------------------------------------
89
- # Provider dispatch
90
- # ---------------------------------------------------------------------------
91
-
92
- def _call_openai(model_id: str, prompt: str, api_key: str, base_url: str | None) -> str:
93
- client = OpenAI(api_key=api_key, base_url=base_url)
94
- resp = client.chat.completions.create(
95
- model=model_id,
96
- messages=[{"role": "user", "content": prompt}],
97
- )
98
- return resp.choices[0].message.content
99
-
100
-
101
- def _call_anthropic(model_id: str, prompt: str, api_key: str) -> str:
102
- client = anthropic.Anthropic(api_key=api_key)
103
- resp = client.messages.create(
104
- model=model_id,
105
- max_tokens=4096,
106
- messages=[{"role": "user", "content": prompt}],
107
- )
108
- return resp.content[0].text
109
-
110
-
111
- def _call_gemini(model_id: str, prompt: str, api_key: str) -> str:
112
- client = genai.Client(api_key=api_key)
113
- resp = client.models.generate_content(model=model_id, contents=prompt)
114
- return resp.text
115
-
116
-
117
- def call_model(
118
- display_name: str,
119
- prompt: str,
120
- user_key: str | None = None,
121
- user_base_url: str | None = None,
122
- user_model_id: str | None = None,
123
- ) -> str:
124
- """Call a reference model from the registry.
125
-
126
- User-supplied base_url / model_id override env-var defaults, which in turn
127
- override the hardcoded registry values.
128
- """
129
- entry = MODEL_REGISTRY.get(display_name)
130
- if entry is None:
131
- raise ValueError(f"Unknown model: {display_name}")
132
-
133
- api_key = _resolve_key(entry["env_var"], user_key)
134
- provider = entry["provider"]
135
-
136
- # Resolve: user input > env var > registry default
137
- default_base_url, default_model_id = get_model_defaults(display_name)
138
- model_id = (user_model_id.strip() if user_model_id and user_model_id.strip() else "") or default_model_id
139
- base_url = (user_base_url.strip() if user_base_url and user_base_url.strip() else "") or default_base_url or None
140
-
141
- if provider in ("openai", "openai_compat"):
142
- return _call_openai(model_id, prompt, api_key, base_url)
143
- elif provider == "anthropic":
144
- return _call_anthropic(model_id, prompt, api_key)
145
- elif provider == "gemini":
146
- return _call_gemini(model_id, prompt, api_key)
147
- else:
148
- raise ValueError(f"Unknown provider: {provider}")
149
-
150
-
151
- def call_custom_endpoint(
152
- base_url: str, model_name: str, prompt: str, api_key: str
153
- ) -> str:
154
- """Call a user-supplied Dify application endpoint (left column).
155
-
156
- Dify API docs: https://docs.dify.ai/en/guides/application-publishing/developing-with-apis
157
-
158
- base_url should be the Dify API base, e.g. https://api.dify.ai/v1
159
- The endpoint called is {base_url}/chat-messages (for Chat apps).
160
- """
161
- if not base_url or not base_url.strip():
162
- raise ValueError("API endpoint URL is required for your Dify model.")
163
- if not api_key or not api_key.strip():
164
- raise ValueError("API Key (Secret Key) is required for Dify.")
165
-
166
- url = base_url.strip().rstrip("/") + "/chat-messages"
167
- headers = {
168
- "Authorization": f"Bearer {api_key.strip()}",
169
- "Content-Type": "application/json",
170
- }
171
- payload = {
172
- "inputs": {},
173
- "query": prompt,
174
- "response_mode": "blocking",
175
- "user": "llm-compare-user",
176
- }
177
-
178
- resp = requests.post(url, json=payload, headers=headers, timeout=120)
179
- resp.raise_for_status()
180
- data = resp.json()
181
- answer = data.get("answer", "")
182
- if not answer:
183
- raise ValueError(f"Dify returned no answer. Full response: {data}")
184
- return answer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt DELETED
@@ -1,5 +0,0 @@
1
- gradio
2
- openai
3
- anthropic
4
- google-genai
5
- openpyxl