pythia-160m-c4

Pythia-160M trained on C4 from scratch (no pre-pretraining). Baseline.

Part of an experiment reproducing and extending the pruning analysis from "Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases" (Hu et al., 2025, arXiv:2502.19249).

The hypothesis under test: pre-pretraining on English itself yields attention-head circuits as sparse/transferable as pre-pretraining on k-shuffle Dyck.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gizemyc/pythia-160m-c4")
tokenizer = AutoTokenizer.from_pretrained("gizemyc/pythia-160m-c4")

Training

Base architecture: EleutherAI/pythia-160m (12 layers x 12 heads)
Pretraining data: C4 (English)
Pre-pretraining (where applicable): k-shuffle Dyck or English/C4

Downloads last month: 10

Safetensors

Model size

0.2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gizemyc/pythia-160m-c4

Base model

EleutherAI/pythia-160m

Finetuned

(343)

this model

Paper for gizemyc/pythia-160m-c4

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Paper • 2502.19249 • Published May 27, 2025