Spaces:

CliDyn
/

Eurus

Sleeping

App Files Files Community

dmpantiu commited on Feb 19

Commit

09f0030

verified ·

1 Parent(s): 4fbe36e

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

RESPONSES.TXT +0 -0
scripts/qa_image_review.py +38 -16
src/eurus/config.py +1 -1

RESPONSES.TXT ADDED Viewed

File without changes

scripts/qa_image_review.py CHANGED Viewed

@@ -105,21 +105,43 @@ QA_QUERIES = {
 REVIEW_SYSTEM_PROMPT = """\
-You are a senior scientific visualization reviewer for a climate/weather data agent.
 You will receive one or more PNG plots generated by an AI agent and the TASK that the agent was asked to complete.
 Review each plot against the task and provide a structured assessment:
-1. **Task Compliance** (1-10): Does the plot address what was asked?
-2. **Scientific Accuracy** (1-10): Are axes labeled, units correct, colorbar present, projections reasonable?
-3. **Visual Quality** (1-10): Is the plot publication-quality? Good resolution, readable labels, professional aesthetics?
-4. **Spatial/Map Quality** (1-10): If it's a map — does it have coastlines, proper projection, geographic labels? If not a map, rate the chart type appropriateness.
-5. **Overall Score** (1-10): Weighted average considering all factors.
 Also provide:
-- **Summary**: 1-2 sentence summary of what the plot shows.
-- **Strengths**: Key things done well.
-- **Issues**: Any problems, missing elements, or improvements needed.
 Respond ONLY in valid JSON with this exact structure:
 {
@@ -130,7 +152,7 @@ Respond ONLY in valid JSON with this exact structure:
   "overall_score": <int>,
   "summary": "<string>",
   "strengths": ["<string>", ...],
-  "issues": ["<string>", ...]
 }
 """
@@ -162,7 +184,7 @@ def review_single_question(client: genai.Client, qid: int, task: str,
             img_bytes = f.read()
         parts.append(types.Part.from_bytes(data=img_bytes, mime_type="image/png"))
-    for attempt in range(4):
         try:
             response = client.models.generate_content(
                 model=model,
@@ -170,7 +192,7 @@ def review_single_question(client: genai.Client, qid: int, task: str,
                 config=types.GenerateContentConfig(
                     system_instruction=REVIEW_SYSTEM_PROMPT,
                     temperature=0.2,
-                    max_output_tokens=1000,
                 ),
             )
             raw = response.text.strip()
@@ -194,12 +216,12 @@ def review_single_question(client: genai.Client, qid: int, task: str,
         except Exception as e:
             err_str = str(e)
             if "429" in err_str or "RESOURCE_EXHAUSTED" in err_str:
-                wait = min(2 ** attempt * 5, 60)
-                print(f"\n  Rate limited, waiting {wait}s (attempt {attempt+1}/4)...", end="", flush=True)
                 time.sleep(wait)
             else:
-                if attempt < 3:
-                    time.sleep(2)
                     continue
                 return {"error": str(e)[:300]}

 REVIEW_SYSTEM_PROMPT = """\
+You are a RUTHLESS, METICULOUS senior scientific visualization reviewer for a climate/weather data agent.
 You will receive one or more PNG plots generated by an AI agent and the TASK that the agent was asked to complete.
+YOUR #1 JOB: For EVERY issue you find, describe it with EXACT SPECIFICITY.
+Do NOT say "labels are unclear" — say EXACTLY which label, where it is, and what is wrong with it.
+Do NOT say "colorbar could be better" — say EXACTLY what the colorbar shows, what it should show, and what the specific problem is.
+Do NOT give vague feedback. Every single issue MUST pinpoint the EXACT location and EXACT problem in the figure.
+CRITICAL: Be EXTREMELY SPECIFIC about problems. Point to EXACT elements:
+- "The y-axis label says 'Value' but should say 'Temperature (°C)'"
+- "The colorbar range is 270-310K but should be converted to °C for readability"
+- "Coastlines are missing from the spatial map — there is no land/ocean boundary visible"
+- "The title says 'January 2024' but the x-axis data only covers December 2023"
+- "The legend overlaps with the data in the upper-right quadrant, obscuring the January peak"
+- "Wind vectors are plotted but have no reference arrow showing the scale"
+- "The projection is PlateCarree but should be a polar stereographic for Arctic data above 70°N"
+For EACH problem: describe WHERE in the figure it is, WHAT exactly is wrong, and WHAT it should be instead.
 Review each plot against the task and provide a structured assessment:
+1. **Task Compliance** (1-10): Does the plot address EXACTLY what was asked? Check every single requirement in the task description. If the task says "two-panel" and there's only one panel, that is a major failure. If the task says "vs" comparison and only one dataset is shown, that is a failure. Be strict.
+2. **Scientific Accuracy** (1-10): Are ALL axes labeled with correct units? Is the colorbar present with proper units and range? Are values physically reasonable (e.g., SST not showing 0K)? Are projections appropriate for the region? Check EVERY axis, EVERY label, EVERY unit.
+3. **Visual Quality** (1-10): Is it publication-quality? Check: font sizes readable? Labels not overlapping data? Grid lines appropriate? Color scheme suitable (e.g., diverging for anomalies, sequential for absolute values)? Title descriptive and correct?
+4. **Spatial/Map Quality** (1-10): For maps — are coastlines drawn? Is the projection correct for the region? Are lat/lon gridlines present? Are geographic features identifiable? For non-maps — is the chart type appropriate?
+5. **Overall Score** (1-10): Weighted average. Be HARSH — a score of 8+ means near-perfect.
 Also provide:
+- **Summary**: 1-2 sentence factual summary of what the plot actually shows.
+- **Strengths**: Specific things done well. Be precise — not "good colors" but "diverging RdBu colormap correctly centered at zero for anomaly data".
+- **Issues**: LIST EVERY SINGLE PROBLEM. Each issue MUST describe the EXACT element, its EXACT location in the figure, WHAT is wrong, and WHAT it should be. DO NOT BE VAGUE. This is the MOST IMPORTANT part of your review. Be exhaustive. Miss nothing.
+I REPEAT: The "issues" field is the MOST CRITICAL part. Every issue must be SPECIFIC and ACTIONABLE. Generic feedback like "could be improved" is UNACCEPTABLE. Say EXACTLY what needs to change and WHERE.
 Respond ONLY in valid JSON with this exact structure:
 {
   "overall_score": <int>,
   "summary": "<string>",
   "strengths": ["<string>", ...],
+  "issues": ["<string — MUST be specific and exact, describing WHERE and WHAT>", ...]
 }
 """
             img_bytes = f.read()
         parts.append(types.Part.from_bytes(data=img_bytes, mime_type="image/png"))
+    for attempt in range(6):
         try:
             response = client.models.generate_content(
                 model=model,
                 config=types.GenerateContentConfig(
                     system_instruction=REVIEW_SYSTEM_PROMPT,
                     temperature=0.2,
+                    max_output_tokens=4096,
                 ),
             )
             raw = response.text.strip()
         except Exception as e:
             err_str = str(e)
             if "429" in err_str or "RESOURCE_EXHAUSTED" in err_str:
+                wait = min(2 ** attempt * 15, 120)
+                print(f"\n  Rate limited, waiting {wait}s (attempt {attempt+1}/6)...", end="", flush=True)
                 time.sleep(wait)
             else:
+                if attempt < 5:
+                    time.sleep(3)
                     continue
                 return {"error": str(e)[:300]}

src/eurus/config.py CHANGED Viewed

@@ -506,7 +506,7 @@ class AgentConfig:
     # Data Settings
     data_source: str = "earthmover-public/era5-surface-aws"
     default_query_type: str = "temporal"
-    max_download_size_gb: float = 5.0
     # Retrieval Settings
     max_retries: int = 5

     # Data Settings
     data_source: str = "earthmover-public/era5-surface-aws"
     default_query_type: str = "temporal"
+    max_download_size_gb: float = 15.0
     # Retrieval Settings
     max_retries: int = 5