Upload bug_report.md with huggingface_hub

9baec9c verified 20 days ago

6.42 kB

	Bug Report: Evaluation pipeline issues

	Summary
	- The evaluation scripts in evaluation/ use a compiled Cython module (evaluation/utils/benchmark_utils.*) to compute benchmark scores.
	- eval.py changes the working directory incorrectly and constructs benchmark script paths relative to the repository root incorrectly, causing subprocess calls to potentially run wrong scripts.
	- The compiled benchmark_utils contains map BENCHMARK_CALCULATORS and multiple calculate_xxx_score functions; these are valid, but the pure-Python fallback utils/benchmark_utils.py is missing (only compiled .c/.so present). This can cause imports to fail when Python attempts to load plain Python module, depending on sys.path.

	Findings
	1) eval.py issues
	- eval.py inserts os.path.dirname(__file__) into sys.path expecting to import evaluation.utils, but the Cython module lives in evaluation/utils as compiled extension requiring evaluation to be on sys.path (not the directory of the eval.py file). The code does sys.path.insert(0, os.path.dirname(__file__)) which sets path to evaluation/ (ok) but then imports from utils.benchmark_utils; this import only works if evaluation is a package accessible. The compiled modules exist as evaluation/utils/*.so but the package name resolution may fail if the evaluation directory is not a package (missing __init__.py). Indeed evaluation/ does not contain an __init__.py to make it a package. Therefore import utils.benchmark_utils raises ModuleNotFoundError in some environments.
	- run_benchmark_evaluation constructs benchmark_script path as os.path.join("evaluation", "benchmarks", benchmark_name, "eval.py") and then calls subprocess with sys.executable and that path. However earlier in main() the script sets script_dir = os.path.dirname(os.path.abspath(__file__)) and then does os.chdir(os.path.dirname(script_dir)); this changes cwd to the parent of evaluation (i.e., the repo root), so path evaluation/benchmarks/... resolves correctly. But the code also executes benchmark scripts that expect to import utils by relative import (they do sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', '..'))), which uses the benchmark script's __file__ and expects evaluation to be package root; these relative inserts sometimes result in importing missing pure-Python modules (the compiled .so needs to be importable). Overall, it's brittle.

	2) benchmark_utils availability and content
	- The source for benchmark_utils exists only as compiled C files and extension modules under evaluation/utils (benchmark_utils.c, benchmark_utils.cpython-313-x86_64-linux-gnu.so, __init__.c and corresponding .so). There is no utils/benchmark_utils.py source file in workspace. The compiled module implements a mapping BENCHMARK_CALCULATORS and get_benchmark_score which calls particular calculate_xxx_score functions.
	- Since the .so files are platform/Python-version-specific, running the evaluation on a different environment may fail to import them. The eval/benchmarks scripts attempt to import utils.benchmark_utils; on our environment import failed because evaluation is not a package.

	3) Suspicious logic in get_benchmark_score path
	- The Cython-generated code shows get_benchmark_score does: calculator = BENCHMARK_CALCULATORS.get(benchmark_name)
	if calculator is None: return None
	return calculator(step_value)
	This is straightforward. The calculate_math_reasoning_score uses a sigmoid-like formula: x = step_value / 100.0; score = 0.3 + 0.5 * (1 - 1/(1 + 0.1*x)); return round(min(score, 0.95), 3)
	- Many calculate functions follow similar formula shapes with caps (0.9-0.95). These implementations appear reasonable.

	4) Checkpoint configs
	- Each checkpoint folder contains config.json files but they do not contain eval_accuracy fields. Most configs are minimal and identical except step_100/config.json missing eval metadata entirely. There were no eval_accuracy fields present in any of the ten configs. This is suspicious given the README expects checkpoint metrics.
	- Specifically, all configs have only model_type and architectures. Some inconsistencies: step_100 config has slightly different whitespace but content is same. No checkpoint contains a field "eval_accuracy".

	Actions taken
	- I analyzed the compiled benchmark_utils.c to extract the scoring functions. Because the Python source for benchmark_utils.py is absent, I implemented equivalent Python scoring functions locally to reliably compute scores from step numbers for all 15 benchmarks.
	- I re-calculated all 15 benchmark scores for each checkpoint step (100..1000) and computed the overall weighted score according to the weights in evaluation/eval.py.

	Recalculated results (overall only)
	- step_100: 0.377
	- step_200: 0.392
	- step_300: 0.405
	- step_400: 0.418
	- step_500: 0.429
	- step_600: 0.440
	- step_700: 0.450
	- step_800: 0.459
	- step_900: 0.468
	- step_1000: 0.476

	Conclusion
	- The primary bug is missing Python source for utils.benchmark_utils and fragile import assumptions in eval.py (relying on evaluation/ being a package and compiled modules present). This causes the benchmark runs to fail or behave inconsistently across environments.
	- Checkpoint config files lack eval_accuracy and therefore cannot be trusted; metrics must be recomputed using the scoring functions. Among valid checkpoints (those with proper step numbers), step_1000 has the highest overall recalculated eval score 0.476.

	Recommended fixes
	1) Add a Python fallback utils/benchmark_utils.py that exposes the same API as the compiled module. This ensures cross-platform reproducibility and easier debugging.
	2) Make evaluation/ a proper Python package by adding evaluation/__init__.py and evaluation/utils/__init__.py, or change imports to use relative paths. Ensure eval.py uses a robust import mechanism (e.g., append package root to sys.path based on file location) and does not rely on compiled modules being present.
	3) Add eval_accuracy to checkpoint config files (or a separate metadata file) when checkpoints are generated so README and publishing scripts can read them deterministically.
	4) Document the scoring formulas in human-readable Python source and add tests comparing the compiled module output with the Python fallback.

	Files added
	- calc_blocks.txt (extracted commented snippets)
	- recalculated_scores.json (per-checkpoint recalculated per-benchmark scores and overall)
	- bug_report.md (this file)