tyang4 commited on
Commit
83088e1
·
verified ·
1 Parent(s): 1e2646d

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +22 -35
src/streamlit_app.py CHANGED
@@ -379,48 +379,35 @@ def evaluate_dataset_with_gpt(subtask: str, df: pd.DataFrame, client=openai_clie
379
  sample_rows = df.head(3)[selected_cols].to_dict(orient="records") # take 3 example rows
380
 
381
  prompt = f"""
382
- You are a data-validation assistant. Your job is to determine whether the dataset below is useful for the research subtask.
383
 
384
- ===== TASK =====
385
- Subtask: "{{subtask}}"
386
 
387
- ===== DATASET PREVIEW =====
388
- Schema: {{schema}}
389
- Example Rows: {{example_rows}}
 
 
390
 
391
  ===== OUTPUT INSTRUCTIONS (follow strictly) =====
 
 
 
 
 
 
 
 
 
 
 
 
392
 
393
- Case A – Relevant:
394
- • Write exactly two sentences, each no more than 30 words.
395
- • Sentence 1: summarize what the dataset contains.
396
- • Sentence 2: explain why it helps answer the subtask.
397
- • Do not mention specific column names or list individual rows.
398
- • Do NOT generate any additional explanation or markdown formatting.
399
-
400
- Case B – Not Relevant:
401
- • Write one or two sentences, each no more than 30 words, describing **only what the dataset contains**.
402
- • Do **NOT** mention the subtask, usefulness, relevance, or missing information.
403
- • Do **NOT** use words like “irrelevant,” “not related,” “not useful,” “not sufficient,” etc.
404
- • After the sentence(s), output the exact header:
405
-
406
- **Additionally, here are some external resources you might find helpful:**
407
-
408
- • Then output 2–3 **real** resources in Markdown link format. Each must:
409
- - Have a **real source name** (e.g., “MatWeb”, not “Name of Source”)
410
- - Contain a **real, working URL** to a page or dataset related to the subtask
411
- - Be formatted exactly like: `- [Source Name](https://example.com)`
412
-
413
- • Do NOT use placeholder text like “Name of Source” or “URL”.
414
- • Do NOT generate any commentary after the list.
415
-
416
- Example for Case B:
417
 
418
- The dataset contains technical specifications of commercial vehicles, such as engine types and dimensions.
419
 
420
- **Additionally, here are some external resources you might find helpful:**
421
- - [Polymer Property Database](https://polymerdatabase.com/)
422
- - [MatWeb Materials Data Sheets](https://www.matweb.com/)
423
- - [NIST Thermophysical Properties of Polymers](https://www.nist.gov/srd/nist-standard-reference-database-147)
424
  """
425
 
426
  rsp = client.chat.completions.create(
 
379
  sample_rows = df.head(3)[selected_cols].to_dict(orient="records") # take 3 example rows
380
 
381
  prompt = f"""
382
+ You are a datavalidation assistant. Decide whether the dataset below is useful for the research subtask.
383
 
384
+ ===== TASK =====
385
+ Subtask: "{subtask}"
386
 
387
+ ===== DATASET PREVIEW =====
388
+ Schema (first {len(selected_cols)} columns):
389
+ {json.dumps(column_info, indent=10)}
390
+ Sample rows (10 max):
391
+ {json.dumps(sample_rows, indent=10)}
392
 
393
  ===== OUTPUT INSTRUCTIONS (follow strictly) =====
394
+ Case A – Relevant:
395
+ • Write exactly two sentences, each no more than 30 words.
396
+ • Summarize what the dataset contains and why it helps the subtask.
397
+ • Do not mention column names or list individual rows.
398
+
399
+ Case B – Not relevant:
400
+ • Write one or two sentences, each no more than 30 words, **describing only what the dataset contains**.
401
+ • Do **not** mention the subtask, relevance, suitability, limitations, or missing information (avoid phrases like “not related,” “does not focus,” “irrelevant,” etc.).
402
+ • After the sentences, output the header **Additionally, here are some external resources you might find helpful:** on a new line. Format your output in markdown as:
403
+ - [Name of Source](URL)
404
+ • Then list 2–3 bullet points, each on its own line, starting with “- ” followed immediately by a URL likely to contain the needed data.
405
+ • No additional commentary.
406
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
407
 
 
408
 
409
+ General rules:
410
+ Plain text only — no code fences. Markdown link syntax (`[text](url)`) is allowed.
 
 
411
  """
412
 
413
  rsp = client.chat.completions.create(