hallumaze / commit_tree.txt
Be2Jay's picture
v1.4: update commit_tree.txt
4c8f12f verified
v1.4 20260324 1625 β€” Phase C frontier models: Claude-3.7-Sonnet (#1) + GPT-4o (#10)
- Claude-3.7-Sonnet: MEI=0.774 SR=56.7% HRR=87.5% d=0.554 (NEW #1)
- GPT-4o: MEI=0.315 SR=6.7% HRR=35.3% d=1.917 (NEW #10, below GPT-4o-mini)
- 10 models x 60 trials = 600 total; analysis_final2.json rebuilt k=10
- Update hallumaze_final.html leaderboard + abstract + findings
- Update paper draft Section 4: new results, F3 frontier inversion
- Add experiment_results/or_phaseC.json
v1.3.1 20260324 β€” MARL v2 GLM-4.7 extension + paper/docs update
- MARL v2 GLM-4.7: MEI=0.694 (+12.9% vs GLM baseline 0.615)
- Add run_marl5stage_v2_glm.py + marl5stage_v2_glm.json
- Paper draft: add Section 6.2 MARL experiment findings
- Update docs/hf_post.md, DOC_INDEX.md
v1.3 20260324 β€” MARL 5-stage experiment (v1 + v2 with fixes)
- MARL v1: naive 5-stage MEI=0.548 (-7.6% vs baseline)
- MARL v2: PathValidator+ConditionalPipeline+MazeReinjection MEI=0.803 (+35.4%)
- Add HF community post draft (docs/hf_post.md)
- Add research docs: fix-solutions, ecological-validity
v1.1 20260323 β€” Add core library, fix statistics, HF Space ready
- Add files/hallumaze.py (core benchmark library, 951 lines)
- Add requirements.txt
- Add index.html for HuggingFace Space (self-contained, HF YAML frontmatter)
- Fix anonymous GitHub URLs β†’ jaytoone/HalluMaze
- Fix statistics: Glass's delta + one-sample Wilcoxon (corrected d=0.8-2.1)
- Add docs/research/ (paper draft + extension TODO analysis)
- Add experiment_results/ full dataset (20 JSON files)
- Fix README: pip install requirements.txt, reproducibility notes