SlowGuess's picture
Upload ABForge model
0c4b0c1 verified
|
Raw
History Blame Contribute Delete
2.56 kB
---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-8B
pipeline_tag: text-generation
library_name: transformers
tags:
- ablation-study
- scientific-reasoning
- post-training
- qwen3
---
# ABForge-Qwen3-8B-Task2
An **ABForge** model for **Task 2: Ablation Plan Generation**.
ABForge is a post-training pipeline for paper-grounded ablation design. This checkpoint is
post-trained with the full ABForge pipeline: supervised fine-tuning from `Qwen/Qwen3-8B` followed by rubric-guided GRPO (**SFT → GRPO**).
## Task
Given a paper's context and a goal, the model produces a detailed, controlled **ablation experiment design plan** (objective, setup, variants, fixed protocols and metrics).
## Training data
SFT on `train/sft_task2_37019.jsonl`, then GRPO on `train/RL_task2_30K.jsonl`, from [`SlowGuess/abforge-data`](https://huggingface.co/datasets/SlowGuess/abforge-data)
(derived from CC-licensed research papers). Evaluation uses the held-out **AblationBench** split
(`eval/ablationbench_200.jsonl`) of the same dataset.
## Related models (Task 2)
- [`SlowGuess/ABForge-Qwen3-8B-Task2`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2) (this model)
- [`SlowGuess/ABForge-Qwen3-8B-Task2-SFT`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2-SFT)
- [`SlowGuess/ABForge-Qwen3-8B-Task2-RL`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2-RL)
## Evaluation
Reproduce AblationBench evaluation with the [`SlowGuess/Abforge_1`](https://github.com/SlowGuess/Abforge_1) code:
```bash
git clone https://github.com/SlowGuess/Abforge_1 && cd Abforge_1
huggingface-cli download SlowGuess/abforge-data --repo-type dataset --local-dir data
export MODEL_PATH=SlowGuess/ABForge-Qwen3-8B-Task2
# 1. Generate predictions on AblationBench
python run_inference_local.py --task 2 \
--input data/eval/ablationbench_200.jsonl \
--output preds.jsonl \
--model-path "$MODEL_PATH" --dtype bf16 --max-new-tokens 4096
# 2. Score against the fixed AblationBench rubric (Claude judge)
export ANTHROPIC_API_KEY=<your-key>
python scripts/eval_task2_claude_rubric_v2.py --input preds.jsonl --output scored.jsonl
```
## Links
- Dataset: [`SlowGuess/abforge-data`](https://huggingface.co/datasets/SlowGuess/abforge-data)
- Code: [`SlowGuess/Abforge_1`](https://github.com/SlowGuess/Abforge_1)
## Citation
```bibtex
@misc{abforge,
title = {ABForge: A Post-Training Pipeline for Paper-Grounded Ablation Design},
author = {ABForge authors},
year = {2026},
howpublished = {\url{https://github.com/SlowGuess/Abforge_1}}
}
```