manmeet3591 commited on
Commit
69e0075
·
verified ·
1 Parent(s): f329cee

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +16 -6
  2. app.py +50 -0
  3. requirements.txt +2 -0
README.md CHANGED
@@ -1,12 +1,22 @@
1
  ---
2
  title: AFDBench
3
- emoji: 🐨
4
- colorFrom: pink
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 6.11.0
8
  app_file: app.py
9
- pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: AFDBench
3
+ emoji: 🌦
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.19.2
8
  app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ short_description: The Weather Forecast Discussion Alignment Benchmark
12
  ---
13
 
14
+ # AFDBench: Area Forecast Discussion Benchmark
15
+
16
+ AFDBench evaluates how well AI models generate professional meteorological text compared to Human NWS Forecasters.
17
+
18
+ ### Core Metrics:
19
+ 1. **Met-Align**: Physical accuracy vs. Human numerical choices.
20
+ 2. **Style-Align**: Linguistic alignment with NWS AFD professional prose.
21
+
22
+ Initial results on 7,734 human samples reveal a massive **Meteorological Hallucination Gap** in zero-shot open models.
app.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import pandas as pd
3
+
4
+ # AFDBench: The Area Forecast Discussion Benchmark
5
+ # Initial Zero-Shot Results from Real A100 Benchmarking (Phase 2)
6
+
7
+ data = {
8
+ "Model": [
9
+ "Human Reference (NWS)",
10
+ "Nous/Hermes-3-Llama-3.1-8B",
11
+ "Qwen/Qwen2.5-7B-Instruct",
12
+ "Phi-3.5-mini-instruct",
13
+ "Mistral-7B-Instruct-v0.3"
14
+ ],
15
+ "Met-Align (%)": [100.0, 11.38, 9.89, 7.13, 5.69],
16
+ "Style-Align (0-1)": [1.00, 0.68, 0.52, 0.52, 0.52],
17
+ "Status": ["GOLD", "Zero-Shot", "Zero-Shot", "Zero-Shot", "Zero-Shot"],
18
+ "Org": ["NWS", "Nous Research", "Alibaba", "Microsoft", "Mistral AI"]
19
+ }
20
+
21
+ df = pd.DataFrame(data).sort_values("Met-Align (%)", ascending=False)
22
+
23
+ def load_leaderboard():
24
+ return df
25
+
26
+ with gr.Blocks(title="AFDBench: Weather Forecast Discussion Benchmark") as demo:
27
+ gr.Markdown("# 🌦 AFDBench")
28
+ gr.Markdown("### The Area Forecast Discussion (AFD) Benchmark")
29
+ gr.Markdown(
30
+ "AFDBench evaluates the ability of LLMs to generate professional National Weather Service (NWS) "
31
+ "Forecast Discussions from numerical weather model data (WeatherNext 2). "
32
+ "We measure **Human Alignment** using two primary metrics:"
33
+ )
34
+
35
+ with gr.Row():
36
+ gr.Markdown("- **Met-Align**: Numerical faithfulness to the Human Meteorologist's choices.")
37
+ gr.Markdown("- **Style-Align**: Adherence to professional NWS AFD dialect and formatting.")
38
+
39
+ gr.DataFrame(value=df, interactive=False)
40
+
41
+ gr.Markdown("---")
42
+ gr.Markdown("### 🚀 Benchmarking Context")
43
+ gr.Markdown(
44
+ "Current results show a massive **Meteorological Hallucination Gap**. While general models can replicate "
45
+ "some stylistic markers (Style-Align ~0.60), they fundamentally fail to align with the numerical decisions "
46
+ "made by human experts (<12% Met-Align)."
47
+ )
48
+
49
+ if __name__ == "__main__":
50
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio==4.19.2
2
+ pandas