klemenk commited on
Commit
e8be55c
·
verified ·
1 Parent(s): 8e19de1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model connectomes: A generational approach to data-efficient language models
2
+ _Second Workshop on Representational Alignment at ICLR 2025_
3
+
4
+ **By:** Klemen Kotar & Greta Tuckute
5
+
6
+ ---
7
+
8
+ ![Paper Figure](generational_connectome_fig.png)
9
+
10
+ ---
11
+
12
+ ## Released Models
13
+
14
+ We have released the following pretrained Generational Connectome GPT models on the Hugging Face Hub:
15
+
16
+ | Model | Description |
17
+ |-------|-------------|
18
+ | [TuKoResearch/ConnectomeGPT100M](https://huggingface.co/TuKoResearch/ConnectomeGPT100M/) | Generational Pruning GPT with learned connectome |
19
+ | [TuKoResearch/RandomConnectomeGPT100M](https://huggingface.co/TuKoResearch/RandomConnectomeGPT100M/) | Generational Pruning GPT with random connectome |
20
+ | [TuKoResearch/NoConnectomeGPT100M](https://huggingface.co/TuKoResearch/NoConnectomeGPT100M/) | Generational Pruning GPT without any connectome |
21
+
22
+ You can evaluate any of these models on downstream NLP benchmarks by specifying the `--model_name` flag in the evaluation scripts.
23
+
24
+ ---
25
+
26
+ ## Installation
27
+
28
+ 1. **Clone the repo**
29
+ ```bash
30
+ git clone https://github.com/TuKoResearch/GenerationalConnectomes.git
31
+ cd GenerationalConnectomes
32
+ ```
33
+
34
+ 2. **Create & activate a Conda environment**
35
+ ```bash
36
+ conda create -n GenerationalConnectomes python=3.11 -y
37
+ conda activate GenerationalConnectomes
38
+ ```
39
+
40
+ 3. **Install PyTorch 2.6** (with the appropriate CUDA toolkit for your setup)
41
+ ```bash
42
+ conda install -c pytorch pytorch==2.6.0 torchvision torchaudio cudatoolkit=11.7 -y
43
+ ```
44
+
45
+ 4. **Install the remaining dependencies**
46
+ ```bash
47
+ pip install --upgrade pip
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+
52
+ ---
53
+
54
+ ## NLP Evaluations
55
+
56
+ We provide an evaluation script for mmlu and hellaswag inside of `evals/`.
57
+ You can reproduce our evaluations by running the following evaluations using the model checkpoints from huggingface:
58
+
59
+ 1. **Run mmlu**:
60
+ ```bash
61
+ python evals/mmlu.py \
62
+ --model_name TuKoResearch/ConnectomeGPT100M \
63
+ --tokenizer_name gpt2 \
64
+ --device cuda:0
65
+ ```
66
+
67
+ 2. **Run hellaswag**:
68
+ ```bash
69
+ python evals/hellaswag.py \
70
+ --model_name TuKoResearch/ConnectomeGPT100M \
71
+ --tokenizer_name gpt2 \
72
+ --device cuda:0
73
+ ```
74
+
75
+ ---
76
+
77
+ ## Behavioral alignment
78
+ We use the Futrell2018 reading time benchmark, which can be obtained from [brain-score language](https://github.com/brain-score/language) and can be loaded using an environment with `xarray` installed. The data can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/assy_Futrell2018.nc?download=true).
79
+
80
+ Once downloaded place the Futrell2018 reading-time dataset (`assy_Futrell2018.nc`) in a directory called `data/`.
81
+
82
+ To run the surprisal evaluation script and compute the Pearson correlation between model surprisal and human reading times (for the final checkpoint), execute:
83
+
84
+ ```bash
85
+ python surprisal_eval.py \
86
+ --model_name TuKoResearch/ConnectomeGPT100M \
87
+ --tokenizer_name gpt2 \
88
+ --device cuda:0
89
+ ```
90
+
91
+
92
+ ---
93
+
94
+ ## Neural alignment
95
+ We use the Tuckute2024 neural benchmark, which can be downloaded from the following [public repository](https://github.com/gretatuckute/drive_suppress_brains) or [brain-score language](https://github.com/brain-score/language). The cross-validation neural predictivity score can be run from [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py) and looped across layers and models using [NeuralAlignment/loop_fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/loop_fit_mapping.py).
96
+
97
+ In some of the analyses, we first localize the LLM language units, per the approach established in AlKhamissi et al., 2025 (_ACL_), from the [following repository](https://github.com/BKHMSI/llm-localization). We adapted this code (POINTER??) to output a binary mask which marks the LLM language units as 1. The [NeuralAlignment/apply_langloc_mask.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/apply_langloc_mask.py) script takes the the numpy binary mask for a given model, and saves the masked embedding values as a csv file, which can then serve as the input to [NeuralAlignment/fit_mapping.py](https://github.com/TuKoResearch/GenerationalConnectomes/blob/main/NeuralAlignment/fit_mapping.py).
98
+
99
+ The regression outputs can be downloaded [here](https://huggingface.co/datasets/TuKoResearch/GenerationalConnectomesData/resolve/main/SHARE.zip?download=true).
100
+
101
+ ---
102
+
103
+ ## LLM Training
104
+
105
+ Once your environment is ready, train the Generational Pruning GPT model from a pruned checkpoitn with:
106
+
107
+ ```bash
108
+ # Single-GPU debug run
109
+ python train.py \
110
+ --run_name my_experiment \
111
+ --train_data_dir path/to/train/*.bin \
112
+ --val_data_dir path/to/val/*.bin \
113
+ --wandb # (optional: log to Weights & Biases)
114
+
115
+ # Multi-GPU DDP run
116
+ torchrun --standalone --nproc_per_node=8 train.py \
117
+ --run_name my_experiment \
118
+ --train_data_dir path/to/train/*.bin \
119
+ --val_data_dir path/to/val/*.bin \
120
+ --per_device_batch_size 16 \
121
+ --batch_size 512 \
122
+ --wandb
123
+ ```
124
+
125
+ **Key flags**:
126
+ - `--run_name`: name for output folder under `./out/` and (optionally) W&B run.
127
+ - `--train_data_dir` / `--val_data_dir`: glob pattern for `.bin` tokenized data.
128
+ - `--per_device_batch_size`: batch size per GPU.
129
+ - `--batch_size`: total batch size (will be split across GPUs).
130
+ - `--wandb`: enable logging to Weights & Biases.
131
+ - `--push_to_hf`: after training, upload final model to Hugging Face Hub under repo name `--run_name`.
132
+
133
+ All other flags (learning rate, scheduler, pruning init, etc.) can be viewed with:
134
+
135
+ ```bash
136
+ python train.py --help
137
+ ```
138
+
139
+ In order to run the prunning training you can run:
140
+
141
+ python train_itp.py \
142
+ --run_name my_experiment \
143
+ --train_data_dir path/to/train/*.bin \
144
+ --val_data_dir path/to/val/*.bin \
145
+ --wandb # (optional: log to Weights & Biases)
146
+
147
+
148
+ This will save a checkpoint to `out/<my_experiment>` which you can use as your connectome for the inner loop trianing above.
149
+
150
+ ---
151
+
152
+ ## Citation
153
+
154
+ If you use this code, please cite:
155
+
156
+ > Kotar, K., & Tuckute, G. (2025). Model connectomes: A generational approach to data-efficient language models. *Second Workshop on Representational Alignment at ICLR 2025*.