---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-8B
pipeline_tag: text-generation
library_name: transformers
tags:
- ablation-study
- scientific-reasoning
- post-training
- qwen3
---

# ABForge-Qwen3-8B-Task2

An **ABForge** model for **Task 2: Ablation Plan Generation**.

ABForge is a post-training pipeline for paper-grounded ablation design. This checkpoint is
post-trained with the full ABForge pipeline: supervised fine-tuning from `Qwen/Qwen3-8B` followed by rubric-guided GRPO (**SFT → GRPO**).

## Task

Given a paper's context and a goal, the model produces a detailed, controlled **ablation experiment design plan** (objective, setup, variants, fixed protocols and metrics).

## Training data

SFT on `train/sft_task2_37019.jsonl`, then GRPO on `train/RL_task2_30K.jsonl`, from [`SlowGuess/abforge-data`](https://huggingface.co/datasets/SlowGuess/abforge-data)
(derived from CC-licensed research papers). Evaluation uses the held-out **AblationBench** split
(`eval/ablationbench_200.jsonl`) of the same dataset.

## Related models (Task 2)

- [`SlowGuess/ABForge-Qwen3-8B-Task2`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2) (this model)
- [`SlowGuess/ABForge-Qwen3-8B-Task2-SFT`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2-SFT)
- [`SlowGuess/ABForge-Qwen3-8B-Task2-RL`](https://huggingface.co/SlowGuess/ABForge-Qwen3-8B-Task2-RL)

## Evaluation

Reproduce AblationBench evaluation with the [`SlowGuess/Abforge_1`](https://github.com/SlowGuess/Abforge_1) code:

```bash
git clone https://github.com/SlowGuess/Abforge_1 && cd Abforge_1
huggingface-cli download SlowGuess/abforge-data --repo-type dataset --local-dir data

export MODEL_PATH=SlowGuess/ABForge-Qwen3-8B-Task2

# 1. Generate predictions on AblationBench
python run_inference_local.py --task 2 \
  --input data/eval/ablationbench_200.jsonl \
  --output preds.jsonl \
  --model-path "$MODEL_PATH" --dtype bf16 --max-new-tokens 4096

# 2. Score against the fixed AblationBench rubric (Claude judge)
export ANTHROPIC_API_KEY=<your-key>
python scripts/eval_task2_claude_rubric_v2.py --input preds.jsonl --output scored.jsonl
```

## Links

- Dataset: [`SlowGuess/abforge-data`](https://huggingface.co/datasets/SlowGuess/abforge-data)
- Code: [`SlowGuess/Abforge_1`](https://github.com/SlowGuess/Abforge_1)

## Citation

```bibtex
@misc{abforge,
  title  = {ABForge: A Post-Training Pipeline for Paper-Grounded Ablation Design},
  author = {ABForge authors},
  year   = {2026},
  howpublished = {\url{https://github.com/SlowGuess/Abforge_1}}
}
```