pythia-160m-c4

Pythia-160M trained on C4 from scratch (no pre-pretraining). Baseline.

Part of an experiment reproducing and extending the pruning analysis from "Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases" (Hu et al., 2025, arXiv:2502.19249).

The hypothesis under test: pre-pretraining on English itself yields attention-head circuits as sparse/transferable as pre-pretraining on k-shuffle Dyck.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gizemyc/pythia-160m-c4")
tokenizer = AutoTokenizer.from_pretrained("gizemyc/pythia-160m-c4")

Training

  • Base architecture: EleutherAI/pythia-160m (12 layers x 12 heads)
  • Pretraining data: C4 (English)
  • Pre-pretraining (where applicable): k-shuffle Dyck or English/C4
Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gizemyc/pythia-160m-c4

Finetuned
(323)
this model

Paper for gizemyc/pythia-160m-c4