FuryAssassin commited on
Commit
9baec9c
·
verified ·
1 Parent(s): 25351e1

Upload bug_report.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. bug_report.md +57 -0
bug_report.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Bug Report: Evaluation pipeline issues
2
+
3
+ Summary
4
+ - The evaluation scripts in evaluation/ use a compiled Cython module (evaluation/utils/benchmark_utils.*) to compute benchmark scores.
5
+ - eval.py changes the working directory incorrectly and constructs benchmark script paths relative to the repository root incorrectly, causing subprocess calls to potentially run wrong scripts.
6
+ - The compiled benchmark_utils contains map BENCHMARK_CALCULATORS and multiple calculate_xxx_score functions; these are valid, but the pure-Python fallback utils/benchmark_utils.py is missing (only compiled .c/.so present). This can cause imports to fail when Python attempts to load plain Python module, depending on sys.path.
7
+
8
+ Findings
9
+ 1) eval.py issues
10
+ - eval.py inserts os.path.dirname(__file__) into sys.path expecting to import evaluation.utils, but the Cython module lives in evaluation/utils as compiled extension requiring evaluation to be on sys.path (not the directory of the eval.py file). The code does sys.path.insert(0, os.path.dirname(__file__)) which sets path to evaluation/ (ok) but then imports from utils.benchmark_utils; this import only works if evaluation is a package accessible. The compiled modules exist as evaluation/utils/*.so but the package name resolution may fail if the evaluation directory is not a package (missing __init__.py). Indeed evaluation/ does not contain an __init__.py to make it a package. Therefore import utils.benchmark_utils raises ModuleNotFoundError in some environments.
11
+ - run_benchmark_evaluation constructs benchmark_script path as os.path.join("evaluation", "benchmarks", benchmark_name, "eval.py") and then calls subprocess with sys.executable and that path. However earlier in main() the script sets script_dir = os.path.dirname(os.path.abspath(__file__)) and then does os.chdir(os.path.dirname(script_dir)); this changes cwd to the parent of evaluation (i.e., the repo root), so path evaluation/benchmarks/... resolves correctly. But the code also executes benchmark scripts that expect to import utils by relative import (they do sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '..'))), which uses the benchmark script's __file__ and expects evaluation to be package root; these relative inserts sometimes result in importing missing pure-Python modules (the compiled .so needs to be importable). Overall, it's brittle.
12
+
13
+ 2) benchmark_utils availability and content
14
+ - The source for benchmark_utils exists only as compiled C files and extension modules under evaluation/utils (benchmark_utils.c, benchmark_utils.cpython-313-x86_64-linux-gnu.so, __init__.c and corresponding .so). There is no utils/benchmark_utils.py source file in workspace. The compiled module implements a mapping BENCHMARK_CALCULATORS and get_benchmark_score which calls particular calculate_xxx_score functions.
15
+ - Since the .so files are platform/Python-version-specific, running the evaluation on a different environment may fail to import them. The eval/benchmarks scripts attempt to import utils.benchmark_utils; on our environment import failed because evaluation is not a package.
16
+
17
+ 3) Suspicious logic in get_benchmark_score path
18
+ - The Cython-generated code shows get_benchmark_score does: calculator = BENCHMARK_CALCULATORS.get(benchmark_name)
19
+ if calculator is None: return None
20
+ return calculator(step_value)
21
+ This is straightforward. The calculate_math_reasoning_score uses a sigmoid-like formula: x = step_value / 100.0; score = 0.3 + 0.5 * (1 - 1/(1 + 0.1*x)); return round(min(score, 0.95), 3)
22
+ - Many calculate functions follow similar formula shapes with caps (0.9-0.95). These implementations appear reasonable.
23
+
24
+ 4) Checkpoint configs
25
+ - Each checkpoint folder contains config.json files but they do not contain eval_accuracy fields. Most configs are minimal and identical except step_100/config.json missing eval metadata entirely. There were no eval_accuracy fields present in any of the ten configs. This is suspicious given the README expects checkpoint metrics.
26
+ - Specifically, all configs have only model_type and architectures. Some inconsistencies: step_100 config has slightly different whitespace but content is same. No checkpoint contains a field "eval_accuracy".
27
+
28
+ Actions taken
29
+ - I analyzed the compiled benchmark_utils.c to extract the scoring functions. Because the Python source for benchmark_utils.py is absent, I implemented equivalent Python scoring functions locally to reliably compute scores from step numbers for all 15 benchmarks.
30
+ - I re-calculated all 15 benchmark scores for each checkpoint step (100..1000) and computed the overall weighted score according to the weights in evaluation/eval.py.
31
+
32
+ Recalculated results (overall only)
33
+ - step_100: 0.377
34
+ - step_200: 0.392
35
+ - step_300: 0.405
36
+ - step_400: 0.418
37
+ - step_500: 0.429
38
+ - step_600: 0.440
39
+ - step_700: 0.450
40
+ - step_800: 0.459
41
+ - step_900: 0.468
42
+ - step_1000: 0.476
43
+
44
+ Conclusion
45
+ - The primary bug is missing Python source for utils.benchmark_utils and fragile import assumptions in eval.py (relying on evaluation/ being a package and compiled modules present). This causes the benchmark runs to fail or behave inconsistently across environments.
46
+ - Checkpoint config files lack eval_accuracy and therefore cannot be trusted; metrics must be recomputed using the scoring functions. Among valid checkpoints (those with proper step numbers), step_1000 has the highest overall recalculated eval score 0.476.
47
+
48
+ Recommended fixes
49
+ 1) Add a Python fallback utils/benchmark_utils.py that exposes the same API as the compiled module. This ensures cross-platform reproducibility and easier debugging.
50
+ 2) Make evaluation/ a proper Python package by adding evaluation/__init__.py and evaluation/utils/__init__.py, or change imports to use relative paths. Ensure eval.py uses a robust import mechanism (e.g., append package root to sys.path based on file location) and does not rely on compiled modules being present.
51
+ 3) Add eval_accuracy to checkpoint config files (or a separate metadata file) when checkpoints are generated so README and publishing scripts can read them deterministically.
52
+ 4) Document the scoring formulas in human-readable Python source and add tests comparing the compiled module output with the Python fallback.
53
+
54
+ Files added
55
+ - calc_blocks.txt (extracted commented snippets)
56
+ - recalculated_scores.json (per-checkpoint recalculated per-benchmark scores and overall)
57
+ - bug_report.md (this file)