Self-Evolving Search Spaces β NeurIPS 2026 Anonymous Reproduction Package
This repository is the anonymous reproduction bundle for the NeurIPS 2026 submission "Self-Evolving Search Spaces: The $\eta^2$ Boundary Between Auto-Tuning and Auto-Research".
It contains the checkpoints, representative run logs, and precomputed analysis artifacts required to reproduce every headline number in the paper. Real-name hosting will be provided upon acceptance.
Contents
checkpoints/
champion-vjepa2-deploy/ # Deployed single-model champion (0.906 mAP_ALL on Nexar)
search-best-vjepa2/ # LLM search-policy best (0.727 mAP on Nexar)
best-non-vjepa2/ # Strongest non-V-JEPA 2 baseline (Table 3 cross-backbone)
asr-8b-lora/ # 8B LoRA adapter (5.30% WER, #1 on Open ASR Leaderboard)
experiments_sample/ # 150 Nexar + 50 ASR run logs sampled from ~3,190 + ~900
computed_values/ # JSON artifacts driving all figures and tables
data/
anova.json # Section 4 ANOVA decomposition
convergence.json # Figure 1 search-policy trajectories
e2e_anova.json # Section 6 E2E LoRA boundary ablation
ablation.json # Obfuscated-names ablation
cost_efficiency_deep.json # Appendix cost table
deployable_analysis/ # Cross-checks for Section 5 (SMAC / LLM test-set results)
oracle_correction.json # Sample-size-matched comparison
results/
best_model.pt # Full V-JEPA 2 champion (weights + head)
soup_best.pt # Model-soup variant used in ablation
Reproducing the headline numbers
Commands below use a standard torch + huggingface_hub + peft environment.
See the requirements.txt in the anonymous code repository
(https://anonymous.4open.science/r/orze-anon) for pinned versions.
Nexar deployed champion β 0.906 mAP_ALL (Section 5)
# Download full V-JEPA 2 checkpoint + head
huggingface-cli download orze-ai/orze-nips-2026 \
computed_values/results/best_model.pt \
--repo-type model --local-dir ./ckpts
# Evaluate with 4-view TTA + CV-mix aggregation
python eval_e2e.py \
--checkpoint ckpts/computed_values/results/best_model.pt \
--tta 4 --aggregation cv_mix
# Expected: mAP_ALL = 0.906 (Public 0.920 / Private 0.893)
Nexar search-policy ceiling β 0.727 (Table 6)
huggingface-cli download orze-ai/orze-nips-2026 \
checkpoints/search-best-vjepa2 --repo-type model --local-dir ./ckpts/search_best
python eval_e2e.py --checkpoint ckpts/search_best/best_model.pt --tta 1
# Expected: mAP β 0.727
Cross-backbone transfer (Table 3, $\Delta \geq 0.35$)
huggingface-cli download orze-ai/orze-nips-2026 checkpoints/best-non-vjepa2 \
--repo-type model --local-dir ./ckpts/non_vjepa
python scripts/cross_backbone_transfer.py \
--vjepa ckpts/search_best/best_model.pt \
--alt ckpts/non_vjepa/best_model.pt \
--output xfer.json
ANOVA decomposition β $\eta^2_{\mathrm{arch}} \in [0.20, 0.51]$ (Section 4)
The ANOVA numbers are fully reproducible from the precomputed JSON without re-running any experiments:
python3 -c "
import json
d = json.load(open('computed_values/data/anova.json'))
print('Nexar eta^2_arch:', d['nexar']['eta_squared']['architecture'])
print('UCF-101 eta^2_arch:', d['ucf101']['eta_squared']['architecture'])
"
# Nexar eta^2_arch: 0.51 | UCF-101 eta^2_arch: 0.148
To reproduce from scratch, run the analysis script from the anonymous code repo against the sampled (or full) experiment logs:
python analyze_predictions.py \
--experiments experiments_sample/ \
--output anova_from_sample.json
E2E LoRA boundary ablation β $\eta^2_{\mathrm{arch}} \downarrow 0.12$ (Section 6)
python3 -c "
import json
d = json.load(open('computed_values/data/e2e_anova.json'))
print(d['eta_squared'])
# {'architecture': 0.12, 'learning_rate': 0.79, ...}
"
ASR 8B LoRA β 5.30% WER (Section 7)
huggingface-cli download orze-ai/orze-nips-2026 checkpoints/asr-8b-lora \
--repo-type model --local-dir ./ckpts/asr_8b
# Merge adapter on base model and evaluate on Open ASR test bundle
python asr_eval.py --adapter ckpts/asr_8b/ --benchmark open_asr
# Expected WER: 5.30%
Experiment log sample
experiments_sample/ contains 150 Nexar + 50 ASR runs sampled uniformly at
random (seed 20260421) from the full campaign. Each run directory contains:
idea_config.yamlβ the configuration proposed by the agentmetrics.jsonβ the resulting test-set metricsclaim.jsonβ the agent's rationale (where logged)
This sample is sufficient to independently re-run the ANOVA script and verify the variance decomposition to within sampling error. The complete ~10,000-run campaign log will be released upon acceptance.
Citation
The paper is under double-blind review; citation metadata will be added upon acceptance.
License
Apache 2.0. Third-party model weights (V-JEPA 2, Llama-3.1-8B for ASR) remain subject to their original licenses.