| # Model connectomes: A generational approach to data-efficient language models | |
| _Second Workshop on Representational Alignment at ICLR 2025_ | |
| **By:** Klemen Kotar & Greta Tuckute | |
| --- | |
|  | |
| --- | |
| ## Released Models | |
| We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub: | |
| | Model | Description | | |
| |-------|-------------| | |
| | [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome | | |
| | [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome | | |
| | [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome | | |
| You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts. | |
| --- | |
| ## Installation | |
| 1. **Clone the repo** | |
| ```bash | |
| git clone https://github.com/TuKoResearch/GenerationalConnectomes.git | |
| cd GenerationalConnectomes | |
| ``` | |
| 2. **Create & activate a Conda environment** | |
| ```bash | |
| conda create -n GenerationalConnectomes python=3.11 -y | |
| conda activate GenerationalConnectomes | |
| ``` | |
| 3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup) | |
| ```bash | |
| conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y | |
| ``` | |
| 4. **Install the remaining dependencies** | |
| ```bash | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| ``` | |
| --- | |
| ## NLP Evaluations | |
| We provide an evaluation script for mmlu and hellaswag inside of `evals/`. | |
| You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface: | |
| 1. **Run mmlu**: | |
| ```bash | |
| python evals/mmlu.py \ | |
| --model_name TuKoResearch/ConnectomeGPT100M \ | |
| --tokenizer_name gpt2 \ | |
| --device cuda:0 | |
| ``` | |
| 2. **Run hellaswag**: | |
| ```bash | |
| python evals/hellaswag.py \ | |
| --model_name TuKoResearch/ConnectomeGPT100M \ | |
| --tokenizer_name gpt2 \ | |
| --device cuda:0 | |
| ``` | |
| --- | |
| ## Behavioral alignment | |
| We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true). | |
| Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`. | |
| To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute: | |
| ```bash | |
| python surprisal_eval.py \ | |
| --model_name TuKoResearch/ConnectomeGPT100M \ | |
| --tokenizer_name gpt2 \ | |
| --device cuda:0 | |
| ``` | |
| --- | |
| ## Neural alignment | |
| We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py). | |
| In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py). | |
| The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true). | |
| --- | |
| ## LLM Training | |
| Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with: | |
| ```bash | |
| # Single-GPU debug run | |
| python train.py \ | |
| --run_name my_experiment \ | |
| --train_data_dir path/to/train/*.bin \ | |
| --val_data_dir path/to/val/*.bin \ | |
| --wandb # (optional: log to Weights & Biases) | |
| # Multi-GPU DDP run | |
| torchrun --standalone --nproc_per_node=8 train.py \ | |
| --run_name my_experiment \ | |
| --train_data_dir path/to/train/*.bin \ | |
| --val_data_dir path/to/val/*.bin \ | |
| --per_device_batch_size 16 \ | |
| --batch_size 512 \ | |
| --wandb | |
| ``` | |
| **Key flags**: | |
| - `--run_name`: name for output folder under `./out/` and (optionally) W&B run. | |
| - `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data. | |
| - `--per_device_batch_size`: batch size per GPU. | |
| - `--batch_size`: total batch size (will be split across GPUs). | |
| - `--wandb`: enable logging to Weights & Biases. | |
| - `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`. | |
| All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with: | |
| ```bash | |
| python train.py --help | |
| ``` | |
| In order to run the prunning training you can run: | |
| python train_itp.py \ | |
| --run_name my_experiment \ | |
| --train_data_dir path/to/train/*.bin \ | |
| --val_data_dir path/to/val/*.bin \ | |
| --wandb # (optional: log to Weights & Biases) | |
| This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above. | |
| --- | |
| ## Citation | |
| If you use this code, please cite: | |
| > Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*. | |