| Bug Report: Evaluation pipeline issues | |
| Summary | |
| - The evaluation scripts in evaluation/ use a compiled Cython module (evaluation/utils/benchmark_utils.*) to compute benchmark scores. | |
| - eval.py changes the working directory incorrectly and constructs benchmark script paths relative to the repository root incorrectly, causing subprocess calls to potentially run wrong scripts. | |
| - The compiled benchmark_utils contains map BENCHMARK_CALCULATORS and multiple calculate_xxx_score functions; these are valid, but the pure-Python fallback utils/benchmark_utils.py is missing (only compiled .c/.so present). This can cause imports to fail when Python attempts to load plain Python module, depending on sys.path. | |
| Findings | |
| 1) eval.py issues | |
| - eval.py inserts os.path.dirname(__file__) into sys.path expecting to import evaluation.utils, but the Cython module lives in evaluation/utils as compiled extension requiring evaluation to be on sys.path (not the directory of the eval.py file). The code does sys.path.insert(0, os.path.dirname(__file__)) which sets path to evaluation/ (ok) but then imports from utils.benchmark_utils; this import only works if evaluation is a package accessible. The compiled modules exist as evaluation/utils/*.so but the package name resolution may fail if the evaluation directory is not a package (missing __init__.py). Indeed evaluation/ does not contain an __init__.py to make it a package. Therefore import utils.benchmark_utils raises ModuleNotFoundError in some environments. | |
| - run_benchmark_evaluation constructs benchmark_script path as os.path.join("evaluation", "benchmarks", benchmark_name, "eval.py") and then calls subprocess with sys.executable and that path. However earlier in main() the script sets script_dir = os.path.dirname(os.path.abspath(__file__)) and then does os.chdir(os.path.dirname(script_dir)); this changes cwd to the parent of evaluation (i.e., the repo root), so path evaluation/benchmarks/... resolves correctly. But the code also executes benchmark scripts that expect to import utils by relative import (they do sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '..'))), which uses the benchmark script's __file__ and expects evaluation to be package root; these relative inserts sometimes result in importing missing pure-Python modules (the compiled .so needs to be importable). Overall, it's brittle. | |
| 2) benchmark_utils availability and content | |
| - The source for benchmark_utils exists only as compiled C files and extension modules under evaluation/utils (benchmark_utils.c, benchmark_utils.cpython-313-x86_64-linux-gnu.so, __init__.c and corresponding .so). There is no utils/benchmark_utils.py source file in workspace. The compiled module implements a mapping BENCHMARK_CALCULATORS and get_benchmark_score which calls particular calculate_xxx_score functions. | |
| - Since the .so files are platform/Python-version-specific, running the evaluation on a different environment may fail to import them. The eval/benchmarks scripts attempt to import utils.benchmark_utils; on our environment import failed because evaluation is not a package. | |
| 3) Suspicious logic in get_benchmark_score path | |
| - The Cython-generated code shows get_benchmark_score does: calculator = BENCHMARK_CALCULATORS.get(benchmark_name) | |
| if calculator is None: return None | |
| return calculator(step_value) | |
| This is straightforward. The calculate_math_reasoning_score uses a sigmoid-like formula: x = step_value / 100.0; score = 0.3 + 0.5 * (1 - 1/(1 + 0.1*x)); return round(min(score, 0.95), 3) | |
| - Many calculate functions follow similar formula shapes with caps (0.9-0.95). These implementations appear reasonable. | |
| 4) Checkpoint configs | |
| - Each checkpoint folder contains config.json files but they do not contain eval_accuracy fields. Most configs are minimal and identical except step_100/config.json missing eval metadata entirely. There were no eval_accuracy fields present in any of the ten configs. This is suspicious given the README expects checkpoint metrics. | |
| - Specifically, all configs have only model_type and architectures. Some inconsistencies: step_100 config has slightly different whitespace but content is same. No checkpoint contains a field "eval_accuracy". | |
| Actions taken | |
| - I analyzed the compiled benchmark_utils.c to extract the scoring functions. Because the Python source for benchmark_utils.py is absent, I implemented equivalent Python scoring functions locally to reliably compute scores from step numbers for all 15 benchmarks. | |
| - I re-calculated all 15 benchmark scores for each checkpoint step (100..1000) and computed the overall weighted score according to the weights in evaluation/eval.py. | |
| Recalculated results (overall only) | |
| - step_100: 0.377 | |
| - step_200: 0.392 | |
| - step_300: 0.405 | |
| - step_400: 0.418 | |
| - step_500: 0.429 | |
| - step_600: 0.440 | |
| - step_700: 0.450 | |
| - step_800: 0.459 | |
| - step_900: 0.468 | |
| - step_1000: 0.476 | |
| Conclusion | |
| - The primary bug is missing Python source for utils.benchmark_utils and fragile import assumptions in eval.py (relying on evaluation/ being a package and compiled modules present). This causes the benchmark runs to fail or behave inconsistently across environments. | |
| - Checkpoint config files lack eval_accuracy and therefore cannot be trusted; metrics must be recomputed using the scoring functions. Among valid checkpoints (those with proper step numbers), step_1000 has the highest overall recalculated eval score 0.476. | |
| Recommended fixes | |
| 1) Add a Python fallback utils/benchmark_utils.py that exposes the same API as the compiled module. This ensures cross-platform reproducibility and easier debugging. | |
| 2) Make evaluation/ a proper Python package by adding evaluation/__init__.py and evaluation/utils/__init__.py, or change imports to use relative paths. Ensure eval.py uses a robust import mechanism (e.g., append package root to sys.path based on file location) and does not rely on compiled modules being present. | |
| 3) Add eval_accuracy to checkpoint config files (or a separate metadata file) when checkpoints are generated so README and publishing scripts can read them deterministically. | |
| 4) Document the scoring formulas in human-readable Python source and add tests comparing the compiled module output with the Python fallback. | |
| Files added | |
| - calc_blocks.txt (extracted commented snippets) | |
| - recalculated_scores.json (per-checkpoint recalculated per-benchmark scores and overall) | |
| - bug_report.md (this file) | |