Self-Evolving Search Spaces β€” NeurIPS 2026 Anonymous Reproduction Package

This repository is the anonymous reproduction bundle for the NeurIPS 2026 submission "Self-Evolving Search Spaces: The $\eta^2$ Boundary Between Auto-Tuning and Auto-Research".

It contains the checkpoints, representative run logs, and precomputed analysis artifacts required to reproduce every headline number in the paper. Real-name hosting will be provided upon acceptance.

Contents

checkpoints/
  champion-vjepa2-deploy/       # Deployed single-model champion (0.906 mAP_ALL on Nexar)
  search-best-vjepa2/           # LLM search-policy best (0.727 mAP on Nexar)
  best-non-vjepa2/              # Strongest non-V-JEPA 2 baseline (Table 3 cross-backbone)
  asr-8b-lora/                  # 8B LoRA adapter (5.30% WER, #1 on Open ASR Leaderboard)
experiments_sample/             # 150 Nexar + 50 ASR run logs sampled from ~3,190 + ~900
computed_values/                # JSON artifacts driving all figures and tables
  data/
    anova.json                  # Section 4 ANOVA decomposition
    convergence.json            # Figure 1 search-policy trajectories
    e2e_anova.json              # Section 6 E2E LoRA boundary ablation
    ablation.json               # Obfuscated-names ablation
    cost_efficiency_deep.json   # Appendix cost table
    deployable_analysis/        # Cross-checks for Section 5 (SMAC / LLM test-set results)
    oracle_correction.json      # Sample-size-matched comparison
  results/
    best_model.pt               # Full V-JEPA 2 champion (weights + head)
    soup_best.pt                # Model-soup variant used in ablation

Reproducing the headline numbers

Commands below use a standard torch + huggingface_hub + peft environment. See the requirements.txt in the anonymous code repository (https://anonymous.4open.science/r/orze-anon) for pinned versions.

Nexar deployed champion β€” 0.906 mAP_ALL (Section 5)

# Download full V-JEPA 2 checkpoint + head
huggingface-cli download orze-ai/orze-nips-2026 \
    computed_values/results/best_model.pt \
    --repo-type model --local-dir ./ckpts

# Evaluate with 4-view TTA + CV-mix aggregation
python eval_e2e.py \
    --checkpoint ckpts/computed_values/results/best_model.pt \
    --tta 4 --aggregation cv_mix
# Expected: mAP_ALL = 0.906 (Public 0.920 / Private 0.893)

Nexar search-policy ceiling β€” 0.727 (Table 6)

huggingface-cli download orze-ai/orze-nips-2026 \
    checkpoints/search-best-vjepa2 --repo-type model --local-dir ./ckpts/search_best
python eval_e2e.py --checkpoint ckpts/search_best/best_model.pt --tta 1
# Expected: mAP β‰ˆ 0.727

Cross-backbone transfer (Table 3, $\Delta \geq 0.35$)

huggingface-cli download orze-ai/orze-nips-2026 checkpoints/best-non-vjepa2 \
    --repo-type model --local-dir ./ckpts/non_vjepa
python scripts/cross_backbone_transfer.py \
    --vjepa ckpts/search_best/best_model.pt \
    --alt ckpts/non_vjepa/best_model.pt \
    --output xfer.json

ANOVA decomposition β€” $\eta^2_{\mathrm{arch}} \in [0.20, 0.51]$ (Section 4)

The ANOVA numbers are fully reproducible from the precomputed JSON without re-running any experiments:

python3 -c "
import json
d = json.load(open('computed_values/data/anova.json'))
print('Nexar eta^2_arch:', d['nexar']['eta_squared']['architecture'])
print('UCF-101 eta^2_arch:', d['ucf101']['eta_squared']['architecture'])
"
# Nexar eta^2_arch: 0.51  |  UCF-101 eta^2_arch: 0.148

To reproduce from scratch, run the analysis script from the anonymous code repo against the sampled (or full) experiment logs:

python analyze_predictions.py \
    --experiments experiments_sample/ \
    --output anova_from_sample.json

E2E LoRA boundary ablation β€” $\eta^2_{\mathrm{arch}} \downarrow 0.12$ (Section 6)

python3 -c "
import json
d = json.load(open('computed_values/data/e2e_anova.json'))
print(d['eta_squared'])
# {'architecture': 0.12, 'learning_rate': 0.79, ...}
"

ASR 8B LoRA β€” 5.30% WER (Section 7)

huggingface-cli download orze-ai/orze-nips-2026 checkpoints/asr-8b-lora \
    --repo-type model --local-dir ./ckpts/asr_8b
# Merge adapter on base model and evaluate on Open ASR test bundle
python asr_eval.py --adapter ckpts/asr_8b/ --benchmark open_asr
# Expected WER: 5.30%

Experiment log sample

experiments_sample/ contains 150 Nexar + 50 ASR runs sampled uniformly at random (seed 20260421) from the full campaign. Each run directory contains:

  • idea_config.yaml β€” the configuration proposed by the agent
  • metrics.json β€” the resulting test-set metrics
  • claim.json β€” the agent's rationale (where logged)

This sample is sufficient to independently re-run the ANOVA script and verify the variance decomposition to within sampling error. The complete ~10,000-run campaign log will be released upon acceptance.

Citation

The paper is under double-blind review; citation metadata will be added upon acceptance.

License

Apache 2.0. Third-party model weights (V-JEPA 2, Llama-3.1-8B for ASR) remain subject to their original licenses.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support