TuKoResearch
/

RandomConnectomeGPT100M

+# Model connectomes: A generational approach to data-efficient language models
+_Second Workshop on Representational Alignment at ICLR 2025_
+**By:** Klemen Kotar & Greta Tuckute
+---
+![Paper Figure](generational_connectome_fig.png)
+---
+## Released Models
+We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub:
+| Model | Description |
+|-------|-------------|
+| [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome |
+| [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome |
+| [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome |
+You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts.
+---
+## Installation
+1. **Clone the repo**
+   ```bash
+   git clone https://github.com/TuKoResearch/GenerationalConnectomes.git
+   cd GenerationalConnectomes
+   ```
+2. **Create & activate a Conda environment**
+   ```bash
+   conda create -n GenerationalConnectomes python=3.11 -y
+   conda activate GenerationalConnectomes
+   ```
+3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup)
+   ```bash
+   conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y
+   ```
+4. **Install the remaining dependencies**
+   ```bash
+   pip install --upgrade pip
+   pip install -r requirements.txt
+   ```
+---
+## NLP Evaluations
+We provide an evaluation script for mmlu and hellaswag inside of `evals/`.
+You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface:
+1. **Run mmlu**:
+   ```bash
+   python evals/mmlu.py \
+     --model_name TuKoResearch/ConnectomeGPT100M \
+     --tokenizer_name gpt2 \
+     --device cuda:0
+   ```
+2. **Run hellaswag**:
+   ```bash
+   python evals/hellaswag.py \
+     --model_name TuKoResearch/ConnectomeGPT100M \
+     --tokenizer_name gpt2 \
+     --device cuda:0
+   ```
+---
+## Behavioral alignment
+We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true).
+Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`.
+To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute:
+```bash
+python surprisal_eval.py \
+  --model_name TuKoResearch/ConnectomeGPT100M \
+  --tokenizer_name gpt2 \
+  --device cuda:0
+```
+---
+## Neural alignment
+We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py).
+In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py).
+The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true).
+---
+## LLM Training
+Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with:
+```bash
+# Single-GPU debug run
+python train.py \
+  --run_name my_experiment \
+  --train_data_dir path/to/train/*.bin \
+  --val_data_dir path/to/val/*.bin \
+  --wandb            # (optional: log to Weights & Biases)
+# Multi-GPU DDP run
+torchrun --standalone --nproc_per_node=8 train.py \
+  --run_name my_experiment \
+  --train_data_dir path/to/train/*.bin \
+  --val_data_dir path/to/val/*.bin \
+  --per_device_batch_size 16 \
+  --batch_size 512 \
+  --wandb
+```
+**Key flags**:
+- `--run_name`: name for output folder under `./out/` and (optionally) W&B run.
+- `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data.
+- `--per_device_batch_size`: batch size per GPU.
+- `--batch_size`: total batch size (will be split across GPUs).
+- `--wandb`: enable logging to Weights & Biases.
+- `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`.
+All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with:
+```bash
+python train.py --help
+```
+In order to run the prunning training you can run:
+python train_itp.py \
+  --run_name my_experiment \
+  --train_data_dir path/to/train/*.bin \
+  --val_data_dir path/to/val/*.bin \
+  --wandb            # (optional: log to Weights & Biases)
+This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above.
+---
+## Citation
+If you use this code, please cite:
+> Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*.