Spaces:

nuojohnchen
/

Kahneman4Review

Sleeping

App Files Files Community

nuocuhz Claude Sonnet 4.6 commited on Mar 13

Commit

918804b

1 Parent(s): 0ab9a26

Analytics: remove Non-evaluative, trim findings to 3 points, add slogan

Browse files

Files changed (1) hide show

analytics.py +11 -14

analytics.py CHANGED Viewed

@@ -18,10 +18,9 @@ DATASETS = {
 }
 LABEL_COLORS = {
-    "System 1":       "#ef4444",
-    "Mixed":          "#f59e0b",
-    "System 2":       "#22c55e",
-    "Non-evaluative": "#94a3b8",
 }
 CONF_COLORS = {
@@ -69,7 +68,7 @@ def load_all() -> dict:
 def fig_label_distribution(data: dict) -> go.Figure:
     """Grouped bar: label distribution per conference."""
-    labels_order = ["System 1", "Mixed", "System 2", "Non-evaluative"]
     confs = list(data.keys())
     fig = go.Figure()
@@ -271,19 +270,17 @@ def build_summary(data: dict) -> str:
 FINDINGS = """
-### Key Findings (100 papers × 3 conferences, ~1,150 reviews)
-1. **Mixed reasoning dominates across all venues (49–57%).** Pure System 1 or System 2 reviews are the minority — most reviewers blend intuitive and analytical modes rather than operating at either extreme.
-2. **ICLR reviewers show more System 1 tendency (35%) than ICML/NeurIPS (~21%).** This may reflect ICLR's open-ended review format, which imposes less structural scaffolding than ICML's field-by-field template — less structure → more impression-driven writing.
-3. **ICML and NeurIPS reviewers show more System 2 tendency (~23–26%) than ICLR (16%).** ICML's structured fields (*Claims and Evidence*, *Theoretical Claims*, *Experimental Designs*) appear to scaffold more explicit, decomposed reasoning.
-4. **Reasoning quality (RQS) is nearly identical across venues (2.80–2.94 / 5).** Despite different formats and communities, the overall analytical depth of peer review is remarkably uniform — suggesting a field-wide ceiling rather than venue-specific culture.
-5. **Decision tier does not predict review quality.** Oral-paper reviews are not systematically stronger than Poster reviews (differences < 0.2 RQS points). Reviewers do not write more analytically for papers they rate highly.
-6. **Checklist Inflation is the dominant bias in all three venues** (50–58% of reviews). Reviewers frequently enumerate specific concerns without analytical linkage, prioritization, or core-claim relevance — mistaking list length for reasoning depth.
-7. **Representativeness Heuristic is more prevalent at NeurIPS (27%) than ICLR/ICML (~17–21%).** NeurIPS reviewers more often judge papers by surface similarity to known strong work rather than explicit criteria.
 """

 }
 LABEL_COLORS = {
+    "System 1": "#ef4444",
+    "Mixed":    "#f59e0b",
+    "System 2": "#22c55e",
 }
 CONF_COLORS = {
 def fig_label_distribution(data: dict) -> go.Figure:
     """Grouped bar: label distribution per conference."""
+    labels_order = ["System 1", "Mixed", "System 2"]
     confs = list(data.keys())
     fig = go.Figure()
 FINDINGS = """
+### Key Findings
+*100 papers × 3 conferences, ~1,150 reviews, rated by claude-sonnet-4-6. Papers sampled by stratified random sampling proportional to acceptance tier (Oral / Spotlight / Poster) within each venue.*
+1. **ICML and NeurIPS reviewers show more System 2 tendency (~23–26%) than ICLR (16%).** ICML's structured fields (*Claims and Evidence*, *Theoretical Claims*, *Experimental Designs*) appear to scaffold more explicit, decomposed reasoning.
+2. **Despite different formats and communities, the overall analytical depth of peer review is remarkably uniform** (RQS 2.80–2.94 / 5), suggesting a field-wide ceiling rather than venue-specific culture.
+3. **Decision tier does not predict review quality.** Oral-paper reviews are not systematically stronger than Poster reviews (differences < 0.2 RQS points). Reviewers do not write more analytically for papers they rate highly.
+---
+> *We are not against AI review. We are against flawed reasoning behind review.*
 """