File size: 6,278 Bytes
e8be55c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
# Model connectomes: A generational approach to data-efficient language models
_Second Workshop on Representational Alignment at ICLR 2025_
**By:** Klemen Kotar & Greta Tuckute
---

---
## Released Models
We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub:
| Model | Description |
|-------|-------------|
| [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome |
| [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome |
| [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome |
You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts.
---
## Installation
1. **Clone the repo**
```bash
git clone https://github.com/TuKoResearch/GenerationalConnectomes.git
cd GenerationalConnectomes
```
2. **Create & activate a Conda environment**
```bash
conda create -n GenerationalConnectomes python=3.11 -y
conda activate GenerationalConnectomes
```
3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup)
```bash
conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y
```
4. **Install the remaining dependencies**
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
---
## NLP Evaluations
We provide an evaluation script for mmlu and hellaswag inside of `evals/`.
You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface:
1. **Run mmlu**:
```bash
python evals/mmlu.py \
--model_name TuKoResearch/ConnectomeGPT100M \
--tokenizer_name gpt2 \
--device cuda:0
```
2. **Run hellaswag**:
```bash
python evals/hellaswag.py \
--model_name TuKoResearch/ConnectomeGPT100M \
--tokenizer_name gpt2 \
--device cuda:0
```
---
## Behavioral alignment
We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true).
Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`.
To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute:
```bash
python surprisal_eval.py \
--model_name TuKoResearch/ConnectomeGPT100M \
--tokenizer_name gpt2 \
--device cuda:0
```
---
## Neural alignment
We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py).
In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py).
The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true).
---
## LLM Training
Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with:
```bash
# Single-GPU debug run
python train.py \
--run_name my_experiment \
--train_data_dir path/to/train/*.bin \
--val_data_dir path/to/val/*.bin \
--wandb # (optional: log to Weights & Biases)
# Multi-GPU DDP run
torchrun --standalone --nproc_per_node=8 train.py \
--run_name my_experiment \
--train_data_dir path/to/train/*.bin \
--val_data_dir path/to/val/*.bin \
--per_device_batch_size 16 \
--batch_size 512 \
--wandb
```
**Key flags**:
- `--run_name`: name for output folder under `./out/` and (optionally) W&B run.
- `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data.
- `--per_device_batch_size`: batch size per GPU.
- `--batch_size`: total batch size (will be split across GPUs).
- `--wandb`: enable logging to Weights & Biases.
- `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`.
All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with:
```bash
python train.py --help
```
In order to run the prunning training you can run:
python train_itp.py \
--run_name my_experiment \
--train_data_dir path/to/train/*.bin \
--val_data_dir path/to/val/*.bin \
--wandb # (optional: log to Weights & Biases)
This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above.
---
## Citation
If you use this code, please cite:
> Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*.
|