DebuggedModel-Verified / bug_report.md
FuryAssassin's picture
Upload bug_report.md with huggingface_hub
9baec9c verified

Bug Report: Evaluation pipeline issues

Summary

  • The evaluation scripts in evaluation/ use a compiled Cython module (evaluation/utils/benchmark_utils.*) to compute benchmark scores.
  • eval.py changes the working directory incorrectly and constructs benchmark script paths relative to the repository root incorrectly, causing subprocess calls to potentially run wrong scripts.
  • The compiled benchmark_utils contains map BENCHMARK_CALCULATORS and multiple calculate_xxx_score functions; these are valid, but the pure-Python fallback utils/benchmark_utils.py is missing (only compiled .c/.so present). This can cause imports to fail when Python attempts to load plain Python module, depending on sys.path.

Findings

  1. eval.py issues
  • eval.py inserts os.path.dirname(file) into sys.path expecting to import evaluation.utils, but the Cython module lives in evaluation/utils as compiled extension requiring evaluation to be on sys.path (not the directory of the eval.py file). The code does sys.path.insert(0, os.path.dirname(file)) which sets path to evaluation/ (ok) but then imports from utils.benchmark_utils; this import only works if evaluation is a package accessible. The compiled modules exist as evaluation/utils/*.so but the package name resolution may fail if the evaluation directory is not a package (missing init.py). Indeed evaluation/ does not contain an init.py to make it a package. Therefore import utils.benchmark_utils raises ModuleNotFoundError in some environments.
  • run_benchmark_evaluation constructs benchmark_script path as os.path.join("evaluation", "benchmarks", benchmark_name, "eval.py") and then calls subprocess with sys.executable and that path. However earlier in main() the script sets script_dir = os.path.dirname(os.path.abspath(file)) and then does os.chdir(os.path.dirname(script_dir)); this changes cwd to the parent of evaluation (i.e., the repo root), so path evaluation/benchmarks/... resolves correctly. But the code also executes benchmark scripts that expect to import utils by relative import (they do sys.path.insert(0, os.path.join(os.path.dirname(file), '..', '..'))), which uses the benchmark script's file and expects evaluation to be package root; these relative inserts sometimes result in importing missing pure-Python modules (the compiled .so needs to be importable). Overall, it's brittle.
  1. benchmark_utils availability and content
  • The source for benchmark_utils exists only as compiled C files and extension modules under evaluation/utils (benchmark_utils.c, benchmark_utils.cpython-313-x86_64-linux-gnu.so, init.c and corresponding .so). There is no utils/benchmark_utils.py source file in workspace. The compiled module implements a mapping BENCHMARK_CALCULATORS and get_benchmark_score which calls particular calculate_xxx_score functions.
  • Since the .so files are platform/Python-version-specific, running the evaluation on a different environment may fail to import them. The eval/benchmarks scripts attempt to import utils.benchmark_utils; on our environment import failed because evaluation is not a package.
  1. Suspicious logic in get_benchmark_score path
  • The Cython-generated code shows get_benchmark_score does: calculator = BENCHMARK_CALCULATORS.get(benchmark_name) if calculator is None: return None return calculator(step_value) This is straightforward. The calculate_math_reasoning_score uses a sigmoid-like formula: x = step_value / 100.0; score = 0.3 + 0.5 * (1 - 1/(1 + 0.1*x)); return round(min(score, 0.95), 3)
  • Many calculate functions follow similar formula shapes with caps (0.9-0.95). These implementations appear reasonable.
  1. Checkpoint configs
  • Each checkpoint folder contains config.json files but they do not contain eval_accuracy fields. Most configs are minimal and identical except step_100/config.json missing eval metadata entirely. There were no eval_accuracy fields present in any of the ten configs. This is suspicious given the README expects checkpoint metrics.
  • Specifically, all configs have only model_type and architectures. Some inconsistencies: step_100 config has slightly different whitespace but content is same. No checkpoint contains a field "eval_accuracy".

Actions taken

  • I analyzed the compiled benchmark_utils.c to extract the scoring functions. Because the Python source for benchmark_utils.py is absent, I implemented equivalent Python scoring functions locally to reliably compute scores from step numbers for all 15 benchmarks.
  • I re-calculated all 15 benchmark scores for each checkpoint step (100..1000) and computed the overall weighted score according to the weights in evaluation/eval.py.

Recalculated results (overall only)

  • step_100: 0.377
  • step_200: 0.392
  • step_300: 0.405
  • step_400: 0.418
  • step_500: 0.429
  • step_600: 0.440
  • step_700: 0.450
  • step_800: 0.459
  • step_900: 0.468
  • step_1000: 0.476

Conclusion

  • The primary bug is missing Python source for utils.benchmark_utils and fragile import assumptions in eval.py (relying on evaluation/ being a package and compiled modules present). This causes the benchmark runs to fail or behave inconsistently across environments.
  • Checkpoint config files lack eval_accuracy and therefore cannot be trusted; metrics must be recomputed using the scoring functions. Among valid checkpoints (those with proper step numbers), step_1000 has the highest overall recalculated eval score 0.476.

Recommended fixes

  1. Add a Python fallback utils/benchmark_utils.py that exposes the same API as the compiled module. This ensures cross-platform reproducibility and easier debugging.
  2. Make evaluation/ a proper Python package by adding evaluation/init.py and evaluation/utils/init.py, or change imports to use relative paths. Ensure eval.py uses a robust import mechanism (e.g., append package root to sys.path based on file location) and does not rely on compiled modules being present.
  3. Add eval_accuracy to checkpoint config files (or a separate metadata file) when checkpoints are generated so README and publishing scripts can read them deterministically.
  4. Document the scoring formulas in human-readable Python source and add tests comparing the compiled module output with the Python fallback.

Files added

  • calc_blocks.txt (extracted commented snippets)
  • recalculated_scores.json (per-checkpoint recalculated per-benchmark scores and overall)
  • bug_report.md (this file)