Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Paper • 2502.19249 • Published
Pythia-160M trained on C4 from scratch (no pre-pretraining). Baseline.
Part of an experiment reproducing and extending the pruning analysis from "Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases" (Hu et al., 2025, arXiv:2502.19249).
The hypothesis under test: pre-pretraining on English itself yields attention-head circuits as sparse/transferable as pre-pretraining on k-shuffle Dyck.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gizemyc/pythia-160m-c4")
tokenizer = AutoTokenizer.from_pretrained("gizemyc/pythia-160m-c4")
EleutherAI/pythia-160m (12 layers x 12 heads)Base model
EleutherAI/pythia-160m