tyang4 commited on
Commit
2585768
·
verified ·
1 Parent(s): 51fd445

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +22 -37
src/streamlit_app.py CHANGED
@@ -412,52 +412,37 @@ def evaluate_dataset_with_gpt(subtask: str, df: pd.DataFrame, client=openai_clie
412
  sample_rows = df.head(3)[selected_cols].to_dict(orient="records") # take 3 example rows
413
 
414
  prompt = f"""
415
- You are a data-validation assistant. Your job is to determine whether the dataset below is useful for the research subtask.
416
 
417
- ===== TASK =====
418
- Subtask: "{{subtask}}"
419
 
420
- ===== DATASET PREVIEW =====
421
- Schema: {{schema}}
422
- Example Rows: {{example_rows}}
 
 
423
 
424
  ===== OUTPUT INSTRUCTIONS (follow strictly) =====
 
 
 
 
 
 
 
 
 
 
 
 
425
 
426
- Case A – Relevant:
427
- • Write exactly two sentences, each no more than 30 words.
428
- • Sentence 1: summarize what the dataset contains.
429
- • Sentence 2: explain why it helps answer the subtask.
430
- • Do not mention specific column names or list individual rows.
431
- • Do NOT generate any additional explanation or markdown formatting.
432
-
433
- Case B – Not Relevant:
434
- • Write one or two sentences, each no more than 30 words, describing **only what the dataset contains**.
435
- • Do **NOT** mention the subtask, usefulness, relevance, or missing information.
436
- • Do **NOT** use words like “irrelevant,” “not related,” “not useful,” “not sufficient,” etc.
437
- • After the sentence(s), output the exact header:
438
-
439
- **Additionally, here are some external resources you might find helpful:**
440
-
441
- • Then output 2–3 **real** resources in Markdown link format. Each must:
442
- - Have a **real source name** (e.g., “MatWeb”, not “Name of Source”)
443
- - Contain a **real, working URL** to a page or dataset related to the subtask
444
- - Be formatted exactly like: `- [Source Name](https://example.com)`
445
-
446
- • Do NOT use placeholder text like “Name of Source” or “URL”.
447
- • Do NOT generate any commentary after the list.
448
-
449
- Example for Case B:
450
-
451
- The dataset contains technical specifications of commercial vehicles, such as engine types and dimensions.
452
 
453
- **Additionally, here are some external resources you might find helpful:**
454
- - [Polymer Property Database](https://polymerdatabase.com/)
455
- - [MatWeb Materials Data Sheets](https://www.matweb.com/)
456
- - [NIST Thermophysical Properties of Polymers](https://www.nist.gov/srd/nist-standard-reference-database-147)
457
 
 
 
458
  """
459
 
460
-
461
  rsp = client.chat.completions.create(
462
  model="gpt-4o",
463
  messages=[{"role": "user", "content": prompt}],
 
412
  sample_rows = df.head(3)[selected_cols].to_dict(orient="records") # take 3 example rows
413
 
414
  prompt = f"""
415
+ You are a datavalidation assistant. Decide whether the dataset below is useful for the research subtask.
416
 
417
+ ===== TASK =====
418
+ Subtask: "{subtask}"
419
 
420
+ ===== DATASET PREVIEW =====
421
+ Schema (first {len(selected_cols)} columns):
422
+ {json.dumps(column_info, indent=10)}
423
+ Sample rows (10 max):
424
+ {json.dumps(sample_rows, indent=10)}
425
 
426
  ===== OUTPUT INSTRUCTIONS (follow strictly) =====
427
+ Case A – Relevant:
428
+ • Write exactly two sentences, each no more than 30 words.
429
+ • Summarize what the dataset contains and why it helps the subtask.
430
+ • Do not mention column names or list individual rows.
431
+
432
+ Case B – Not relevant:
433
+ • Write one or two sentences, each no more than 30 words, **describing only what the dataset contains**.
434
+ • Do **not** mention the subtask, relevance, suitability, limitations, or missing information (avoid phrases like “not related,” “does not focus,” “irrelevant,” etc.).
435
+ • After the sentences, output the header **Additionally, here are some external resources you might find helpful:** on a new line. Format your output in markdown as:
436
+ - [Name of Source](URL)
437
+ • Then list 2–3 bullet points, each on its own line, starting with “- ” followed immediately by a URL likely to contain the needed data.
438
+ • No additional commentary.
439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
440
 
 
 
 
 
441
 
442
+ General rules:
443
+ Plain text only — no code fences. Markdown link syntax (`[text](url)`) is allowed.
444
  """
445
 
 
446
  rsp = client.chat.completions.create(
447
  model="gpt-4o",
448
  messages=[{"role": "user", "content": prompt}],