# Model connectomes: A generational approach to data-efficient language models _Second Workshop on Representational Alignment at ICLR 2025_ **By:** Klemen Kotar & Greta Tuckute --- ![Paper Figure](generational_connectome_fig.png) --- ## Released Models We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub: | Model | Description | |-------|-------------| | [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome | | [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome | | [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome | You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts. --- ## Installation 1. **Clone the repo** ```bash git clone https://github.com/TuKoResearch/GenerationalConnectomes.git cd GenerationalConnectomes ``` 2. **Create & activate a Conda environment** ```bash conda create -n GenerationalConnectomes python=3.11 -y conda activate GenerationalConnectomes ``` 3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup) ```bash conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y ``` 4. **Install the remaining dependencies** ```bash pip install --upgrade pip pip install -r requirements.txt ``` --- ## NLP Evaluations We provide an evaluation script for mmlu and hellaswag inside of `evals/`. You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface: 1. **Run mmlu**: ```bash python evals/mmlu.py \ --model_name TuKoResearch/ConnectomeGPT100M \ --tokenizer_name gpt2 \ --device cuda:0 ``` 2. **Run hellaswag**: ```bash python evals/hellaswag.py \ --model_name TuKoResearch/ConnectomeGPT100M \ --tokenizer_name gpt2 \ --device cuda:0 ``` --- ## Behavioral alignment We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true). Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`. To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute: ```bash python surprisal_eval.py \ --model_name TuKoResearch/ConnectomeGPT100M \ --tokenizer_name gpt2 \ --device cuda:0 ``` --- ## Neural alignment We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py). In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py). The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true). --- ## LLM Training Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with: ```bash # Single-GPU debug run python train.py \ --run_name my_experiment \ --train_data_dir path/to/train/*.bin \ --val_data_dir path/to/val/*.bin \ --wandb # (optional: log to Weights & Biases) # Multi-GPU DDP run torchrun --standalone --nproc_per_node=8 train.py \ --run_name my_experiment \ --train_data_dir path/to/train/*.bin \ --val_data_dir path/to/val/*.bin \ --per_device_batch_size 16 \ --batch_size 512 \ --wandb ``` **Key flags**: - `--run_name`: name for output folder under `./out/` and (optionally) W&B run. - `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data. - `--per_device_batch_size`: batch size per GPU. - `--batch_size`: total batch size (will be split across GPUs). - `--wandb`: enable logging to Weights & Biases. - `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`. All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with: ```bash python train.py --help ``` In order to run the prunning training you can run: python train_itp.py \ --run_name my_experiment \ --train_data_dir path/to/train/*.bin \ --val_data_dir path/to/val/*.bin \ --wandb # (optional: log to Weights & Biases) This will save a checkpoint to `out/` which you can use as your connectome for the inner loop trianing above. --- ## Citation If you use this code, please cite: > Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*.