File size: 6,416 Bytes

9baec9c

Bug Report: Evaluation pipeline issues

Summary
- The evaluation scripts in evaluation/ use a compiled Cython module (evaluation/utils/benchmark_utils.*) to compute benchmark scores.
- eval.py changes the working directory incorrectly and constructs benchmark script paths relative to the repository root incorrectly, causing subprocess calls to potentially run wrong scripts.
- The compiled benchmark_utils contains map BENCHMARK_CALCULATORS and multiple calculate_xxx_score functions; these are valid, but the pure-Python fallback utils/benchmark_utils.py is missing (only compiled .c/.so present). This can cause imports to fail when Python attempts to load plain Python module, depending on sys.path.

Findings
1) eval.py issues
- eval.py inserts os.path.dirname(__file__) into sys.path expecting to import evaluation.utils, but the Cython module lives in evaluation/utils as compiled extension requiring evaluation to be on sys.path (not the directory of the eval.py file). The code does sys.path.insert(0, os.path.dirname(__file__)) which sets path to evaluation/ (ok) but then imports from utils.benchmark_utils; this import only works if evaluation is a package accessible. The compiled modules exist as evaluation/utils/*.so but the package name resolution may fail if the evaluation directory is not a package (missing __init__.py). Indeed evaluation/ does not contain an __init__.py to make it a package. Therefore import utils.benchmark_utils raises ModuleNotFoundError in some environments.
- run_benchmark_evaluation constructs benchmark_script path as os.path.join("evaluation", "benchmarks", benchmark_name, "eval.py") and then calls subprocess with sys.executable and that path. However earlier in main() the script sets script_dir = os.path.dirname(os.path.abspath(__file__)) and then does os.chdir(os.path.dirname(script_dir)); this changes cwd to the parent of evaluation (i.e., the repo root), so path evaluation/benchmarks/... resolves correctly. But the code also executes benchmark scripts that expect to import utils by relative import (they do sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '..'))), which uses the benchmark script's __file__ and expects evaluation to be package root; these relative inserts sometimes result in importing missing pure-Python modules (the compiled .so needs to be importable). Overall, it's brittle.

2) benchmark_utils availability and content
- The source for benchmark_utils exists only as compiled C files and extension modules under evaluation/utils (benchmark_utils.c, benchmark_utils.cpython-313-x86_64-linux-gnu.so, __init__.c and corresponding .so). There is no utils/benchmark_utils.py source file in workspace. The compiled module implements a mapping BENCHMARK_CALCULATORS and get_benchmark_score which calls particular calculate_xxx_score functions.
- Since the .so files are platform/Python-version-specific, running the evaluation on a different environment may fail to import them. The eval/benchmarks scripts attempt to import utils.benchmark_utils; on our environment import failed because evaluation is not a package.

3) Suspicious logic in get_benchmark_score path
- The Cython-generated code shows get_benchmark_score does: calculator = BENCHMARK_CALCULATORS.get(benchmark_name)
  if calculator is None: return None
  return calculator(step_value)
  This is straightforward. The calculate_math_reasoning_score uses a sigmoid-like formula: x = step_value / 100.0; score = 0.3 + 0.5 * (1 - 1/(1 + 0.1*x)); return round(min(score, 0.95), 3)
- Many calculate functions follow similar formula shapes with caps (0.9-0.95). These implementations appear reasonable.

4) Checkpoint configs
- Each checkpoint folder contains config.json files but they do not contain eval_accuracy fields. Most configs are minimal and identical except step_100/config.json missing eval metadata entirely. There were no eval_accuracy fields present in any of the ten configs. This is suspicious given the README expects checkpoint metrics.
- Specifically, all configs have only model_type and architectures. Some inconsistencies: step_100 config has slightly different whitespace but content is same. No checkpoint contains a field "eval_accuracy".

Actions taken
- I analyzed the compiled benchmark_utils.c to extract the scoring functions. Because the Python source for benchmark_utils.py is absent, I implemented equivalent Python scoring functions locally to reliably compute scores from step numbers for all 15 benchmarks.
- I re-calculated all 15 benchmark scores for each checkpoint step (100..1000) and computed the overall weighted score according to the weights in evaluation/eval.py.

Recalculated results (overall only)
- step_100: 0.377
- step_200: 0.392
- step_300: 0.405
- step_400: 0.418
- step_500: 0.429
- step_600: 0.440
- step_700: 0.450
- step_800: 0.459
- step_900: 0.468
- step_1000: 0.476

Conclusion
- The primary bug is missing Python source for utils.benchmark_utils and fragile import assumptions in eval.py (relying on evaluation/ being a package and compiled modules present). This causes the benchmark runs to fail or behave inconsistently across environments.
- Checkpoint config files lack eval_accuracy and therefore cannot be trusted; metrics must be recomputed using the scoring functions. Among valid checkpoints (those with proper step numbers), step_1000 has the highest overall recalculated eval score 0.476.

Recommended fixes
1) Add a Python fallback utils/benchmark_utils.py that exposes the same API as the compiled module. This ensures cross-platform reproducibility and easier debugging.
2) Make evaluation/ a proper Python package by adding evaluation/__init__.py and evaluation/utils/__init__.py, or change imports to use relative paths. Ensure eval.py uses a robust import mechanism (e.g., append package root to sys.path based on file location) and does not rely on compiled modules being present.
3) Add eval_accuracy to checkpoint config files (or a separate metadata file) when checkpoints are generated so README and publishing scripts can read them deterministically.
4) Document the scoring formulas in human-readable Python source and add tests comparing the compiled module output with the Python fallback.

Files added
- calc_blocks.txt (extracted commented snippets)
- recalculated_scores.json (per-checkpoint recalculated per-benchmark scores and overall)
- bug_report.md (this file)