File size: 6,278 Bytes
e8be55c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# Model connectomes: A generational approach to data-efficient language models  
_Second Workshop on Representational Alignment at ICLR 2025_  

**By:** Klemen Kotar & Greta Tuckute

---

![Paper Figure](generational_connectome_fig.png)

---

## Released Models

We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub:

| Model | Description |
|-------|-------------|
| [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome |
| [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome |
| [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome |

You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts.

---

## Installation

1. **Clone the repo**  
   ```bash
   git clone https://github.com/TuKoResearch/GenerationalConnectomes.git
   cd GenerationalConnectomes
   ```

2. **Create & activate a Conda environment**  
   ```bash
   conda create -n GenerationalConnectomes python=3.11 -y
   conda activate GenerationalConnectomes
   ```

3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup)  
   ```bash
   conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y
   ```

4. **Install the remaining dependencies**  
   ```bash
   pip install --upgrade pip
   pip install -r requirements.txt
   ```


---

## NLP Evaluations

We provide an evaluation script for mmlu and hellaswag inside of `evals/`.
You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface:

1. **Run mmlu**:
   ```bash
   python evals/mmlu.py \
     --model_name TuKoResearch/ConnectomeGPT100M \
     --tokenizer_name gpt2 \
     --device cuda:0
   ```

2. **Run hellaswag**:
   ```bash
   python evals/hellaswag.py \
     --model_name TuKoResearch/ConnectomeGPT100M \
     --tokenizer_name gpt2 \
     --device cuda:0
   ```

---

## Behavioral alignment
We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true).

Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`.

To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute:

```bash
python surprisal_eval.py \
  --model_name TuKoResearch/ConnectomeGPT100M \
  --tokenizer_name gpt2 \
  --device cuda:0
```


---

## Neural alignment
We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py).

In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py).

The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true).

---

## LLM Training

Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with:

```bash
# Single-GPU debug run
python train.py \
  --run_name my_experiment \
  --train_data_dir path/to/train/*.bin \
  --val_data_dir path/to/val/*.bin \
  --wandb            # (optional: log to Weights & Biases)

# Multi-GPU DDP run
torchrun --standalone --nproc_per_node=8 train.py \
  --run_name my_experiment \
  --train_data_dir path/to/train/*.bin \
  --val_data_dir path/to/val/*.bin \
  --per_device_batch_size 16 \
  --batch_size 512 \
  --wandb
```

**Key flags**:
- `--run_name`: name for output folder under `./out/` and (optionally) W&B run.  
- `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data.  
- `--per_device_batch_size`: batch size per GPU.  
- `--batch_size`: total batch size (will be split across GPUs).  
- `--wandb`: enable logging to Weights & Biases.  
- `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`.

All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with:

```bash
python train.py --help
```

In order to run the prunning training you can run:

python train_itp.py \
  --run_name my_experiment \
  --train_data_dir path/to/train/*.bin \
  --val_data_dir path/to/val/*.bin \
  --wandb            # (optional: log to Weights & Biases)


This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above.

---

## Citation

If you use this code, please cite:

> Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*.