Spaces:

manmeet3591
/

AFDBench

Runtime error

App Files Files Community

manmeet3591 commited on Apr 7

Commit

69e0075

verified ·

1 Parent(s): f329cee

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +16 -6
app.py +50 -0
requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,12 +1,22 @@
 ---
 title: AFDBench
-emoji: 🐨
-colorFrom: pink
-colorTo: purple
 sdk: gradio
-sdk_version: 6.11.0
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: AFDBench
+emoji: 🌦
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 4.19.2
 app_file: app.py
+pinned: true
+license: apache-2.0
+short_description: The Weather Forecast Discussion Alignment Benchmark
 ---
+# AFDBench: Area Forecast Discussion Benchmark
+AFDBench evaluates how well AI models generate professional meteorological text compared to Human NWS Forecasters.
+### Core Metrics:
+1. **Met-Align**: Physical accuracy vs. Human numerical choices.
+2. **Style-Align**: Linguistic alignment with NWS AFD professional prose.
+Initial results on 7,734 human samples reveal a massive **Meteorological Hallucination Gap** in zero-shot open models.

app.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import gradio as gr
+import pandas as pd
+# AFDBench: The Area Forecast Discussion Benchmark
+# Initial Zero-Shot Results from Real A100 Benchmarking (Phase 2)
+data = {
+    "Model": [
+        "Human Reference (NWS)",
+        "Nous/Hermes-3-Llama-3.1-8B",
+        "Qwen/Qwen2.5-7B-Instruct",
+        "Phi-3.5-mini-instruct",
+        "Mistral-7B-Instruct-v0.3"
+    ],
+    "Met-Align (%)": [100.0, 11.38, 9.89, 7.13, 5.69],
+    "Style-Align (0-1)": [1.00, 0.68, 0.52, 0.52, 0.52],
+    "Status": ["GOLD", "Zero-Shot", "Zero-Shot", "Zero-Shot", "Zero-Shot"],
+    "Org": ["NWS", "Nous Research", "Alibaba", "Microsoft", "Mistral AI"]
+}
+df = pd.DataFrame(data).sort_values("Met-Align (%)", ascending=False)
+def load_leaderboard():
+    return df
+with gr.Blocks(title="AFDBench: Weather Forecast Discussion Benchmark") as demo:
+    gr.Markdown("# 🌦 AFDBench")
+    gr.Markdown("### The Area Forecast Discussion (AFD) Benchmark")
+    gr.Markdown(
+        "AFDBench evaluates the ability of LLMs to generate professional National Weather Service (NWS) "
+        "Forecast Discussions from numerical weather model data (WeatherNext 2). "
+        "We measure **Human Alignment** using two primary metrics:"
+    )
+    with gr.Row():
+        gr.Markdown("- **Met-Align**: Numerical faithfulness to the Human Meteorologist's choices.")
+        gr.Markdown("- **Style-Align**: Adherence to professional NWS AFD dialect and formatting.")
+    gr.DataFrame(value=df, interactive=False)
+    gr.Markdown("---")
+    gr.Markdown("### 🚀 Benchmarking Context")
+    gr.Markdown(
+        "Current results show a massive **Meteorological Hallucination Gap**. While general models can replicate "
+        "some stylistic markers (Style-Align ~0.60), they fundamentally fail to align with the numerical decisions "
+        "made by human experts (<12% Met-Align)."
+    )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ gradio==4.19.2
2	+ pandas