Create README.md

e8be55c verified 8 months ago

6.28 kB

	# Model connectomes: A generational approach to data-efficient language models
	_Second Workshop on Representational Alignment at ICLR 2025_

	By: Klemen Kotar & Greta Tuckute

	---

	![Paper Figure](generational_connectome_fig.png)

	---

	## Released Models

	We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub:

	\| Model \| Description \|
	\|-------\|-------------\|
	\| [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) \| Generational Pruning GPT with learned connectome \|
	\| [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) \| Generational Pruning GPT with random connectome \|
	\| [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) \| Generational Pruning GPT without any connectome \|

	You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts.

	---

	## Installation

	1. Clone the repo
	```bash
	git clone https://github.com/TuKoResearch/GenerationalConnectomes.git
	cd GenerationalConnectomes
	```

	2. Create & activate a Conda environment
	```bash
	conda create -n GenerationalConnectomes python=3.11 -y
	conda activate GenerationalConnectomes
	```

	3. Install PyTorch 2.6 (with the appropriate CUDA toolkit for your setup)
	```bash
	conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y
	```

	4. Install the remaining dependencies
	```bash
	pip install --upgrade pip
	pip install -r requirements.txt
	```


	---

	## NLP Evaluations

	We provide an evaluation script for mmlu and hellaswag inside of `evals/`.
	You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface:

	1. Run mmlu:
	```bash
	python evals/mmlu.py \
	--model_name TuKoResearch/ConnectomeGPT100M \
	--tokenizer_name gpt2 \
	--device cuda:0
	```

	2. Run hellaswag:
	```bash
	python evals/hellaswag.py \
	--model_name TuKoResearch/ConnectomeGPT100M \
	--tokenizer_name gpt2 \
	--device cuda:0
	```

	---

	## Behavioral alignment
	We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true).

	Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`.

	To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute:

	```bash
	python surprisal_eval.py \
	--model_name TuKoResearch/ConnectomeGPT100M \
	--tokenizer_name gpt2 \
	--device cuda:0
	```


	---

	## Neural alignment
	We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py).

	In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py).

	The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true).

	---

	## LLM Training

	Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with:

	```bash
	# Single-GPU debug run
	python train.py \
	--run_name my_experiment \
	--train_data_dir path/to/train/*.bin \
	--val_data_dir path/to/val/*.bin \
	--wandb # (optional: log to Weights & Biases)

	# Multi-GPU DDP run
	torchrun --standalone --nproc_per_node=8 train.py \
	--run_name my_experiment \
	--train_data_dir path/to/train/*.bin \
	--val_data_dir path/to/val/*.bin \
	--per_device_batch_size 16 \
	--batch_size 512 \
	--wandb
	```

	Key flags:
	- `--run_name`: name for output folder under `./out/` and (optionally) W&B run.
	- `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data.
	- `--per_device_batch_size`: batch size per GPU.
	- `--batch_size`: total batch size (will be split across GPUs).
	- `--wandb`: enable logging to Weights & Biases.
	- `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`.

	All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with:

	```bash
	python train.py --help
	```

	In order to run the prunning training you can run:

	python train_itp.py \
	--run_name my_experiment \
	--train_data_dir path/to/train/*.bin \
	--val_data_dir path/to/val/*.bin \
	--wandb # (optional: log to Weights & Biases)


	This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above.

	---

	## Citation

	If you use this code, please cite:

	> Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. Second Workshop on Representational Alignment at ICLR 2025.