Spaces:

tyang4
/

Ecodata

Sleeping

App Files Files Community

tyang4 commited on Jul 1

Commit

1e2646d

verified ·

1 Parent(s): a311a46

Update src/streamlit_app.py

Browse files

Files changed (1) hide show

src/streamlit_app.py +32 -24

src/streamlit_app.py CHANGED Viewed

@@ -379,40 +379,48 @@ def evaluate_dataset_with_gpt(subtask: str, df: pd.DataFrame, client=openai_clie
     sample_rows = df.head(3)[selected_cols].to_dict(orient="records")   # take 3 example rows
     prompt = f"""
-You are a data‑validation assistant. Decide whether the dataset below is useful for the research subtask.
-===== TASK =====
 Subtask: "{{subtask}}"
-===== DATASET PREVIEW =====
-Schema (first {{len(selected_cols)}} columns):
-{{json.dumps(column_info, indent=20)}}
-Sample rows (20 max):
-{{json.dumps(sample_rows, indent=20)}}
 ===== OUTPUT INSTRUCTIONS (follow strictly) =====
-First, begin your response with one of the following labels on its own line:
-Relevant
-or
-Irrelevant
-Then follow the appropriate instruction below based on your decision.
----
-If you choose "Relevant":
-• Write exactly two sentences, each no more than 30 words.
-• Summarize what the dataset contains and why it helps the subtask.
-• Do not mention column names or list individual rows.
----
-If you choose "Irrelevant":
-• Write one or two sentences, each no more than 30 words, **describing only what the dataset contains**.
-• After the sentences, output the header **Additionally, here are some external resources you might find helpful:** on a new line. Format your output in markdown as:
-- [Name of Source](URL)
-• Then list 2–3 bullet points, each on its own line, starting with “- ” followed immediately by a URL likely to contain the needed data.
-• No additional commentary.
 """
     rsp = client.chat.completions.create(

     sample_rows = df.head(3)[selected_cols].to_dict(orient="records")   # take 3 example rows
     prompt = f"""
+You are a data-validation assistant. Your job is to determine whether the dataset below is useful for the research subtask.
+===== TASK =====
 Subtask: "{{subtask}}"
+===== DATASET PREVIEW =====
+Schema: {{schema}}
+Example Rows: {{example_rows}}
 ===== OUTPUT INSTRUCTIONS (follow strictly) =====
+Case A – Relevant:
+• Write exactly two sentences, each no more than 30 words.
+• Sentence 1: summarize what the dataset contains.
+• Sentence 2: explain why it helps answer the subtask.
+• Do not mention specific column names or list individual rows.
+• Do NOT generate any additional explanation or markdown formatting.
+Case B – Not Relevant:
+• Write one or two sentences, each no more than 30 words, describing **only what the dataset contains**.
+• Do **NOT** mention the subtask, usefulness, relevance, or missing information.
+• Do **NOT** use words like “irrelevant,” “not related,” “not useful,” “not sufficient,” etc.
+• After the sentence(s), output the exact header:
+**Additionally, here are some external resources you might find helpful:**
+• Then output 2–3 **real** resources in Markdown link format. Each must:
+   - Have a **real source name** (e.g., “MatWeb”, not “Name of Source”)
+   - Contain a **real, working URL** to a page or dataset related to the subtask
+   - Be formatted exactly like: `- [Source Name](https://example.com)`
+• Do NOT use placeholder text like “Name of Source” or “URL”.
+• Do NOT generate any commentary after the list.
+Example for Case B:
+The dataset contains technical specifications of commercial vehicles, such as engine types and dimensions.
+**Additionally, here are some external resources you might find helpful:**
+- [Polymer Property Database](https://polymerdatabase.com/)
+- [MatWeb Materials Data Sheets](https://www.matweb.com/)
+- [NIST Thermophysical Properties of Polymers](https://www.nist.gov/srd/nist-standard-reference-database-147)
 """
     rsp = client.chat.completions.create(