biology
MiguelBraganca commited on
Commit
1fbae5e
·
verified ·
1 Parent(s): 8290e34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -39
README.md CHANGED
@@ -1,20 +1,20 @@
1
  # AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks
2
 
3
- Welcome to the inference code of AbBFN2, a state-of-the-art model for antibody sequence generation.
4
 
5
- ## Overview
6
 
7
- AbBFN2 is a generative antibody foundation model trained on a rich dataset of paired antibody sequences alongside their genetic and biophysical metadata. This allows for a unified modelling of diverse data sources and flexible conditional generation at inference time.
8
- AbBFN2 leverages the Bayesian Flow Network (BFN) framework, which models distributions over data rather than the data itself, making it suitable for both discrete (sequences, gene labels) and continuous (biophysical properties) data.
9
 
10
- At inference, AbBFN2 can concurrently generate all 45 data modes. By conditioning on arbitrary combinations of information, the model can handle a variety of tasks without task-specific training.
 
11
 
12
- ![AbBFN2 Overview](abbfn2_overview.png)
13
 
14
  ## Prerequisites
15
- - [Repo](https://github.com/instadeepai/AbBFN2) cloned on your machine.
16
  - Docker installed on your system
17
  - Sufficient computational resources (TPU/GPU recommended)
 
18
 
19
  ## Installation
20
 
@@ -26,27 +26,29 @@ ACCELERATOR = GPU # Options: CPU, TPU, or GPU
26
 
27
  Note: Multi-host inference is not supported in this release. Please use single-host settings only.
28
 
29
- ### Building the Docker Image - Windows/Linux
30
  Run the following command to build the AbBFN2 Docker image:
31
  ```bash
32
  make build
33
  ```
34
  This process typically takes 5-20 minutes depending on your hardware.
35
 
36
- ### Building the Virtual Envrionment - MacOS
37
- Run the following commands to build the AbBFN2 virtual environment:
 
38
  ```bash
39
  conda env create -f environment.yaml
 
40
  ```
41
- Please make sure you activate the environment before running the scripts.
42
-
43
 
44
  ## Usage
45
 
46
  AbBFN2 supports three main generation modes, each with its own configuration file in the `experiments/configs/` directory.
47
 
 
 
48
  ### 1. Unconditional Generation
49
- Generate novel antibody sequences without any constraints.
50
 
51
  Configuration (`unconditional.yaml`):
52
  ```yaml
@@ -60,11 +62,13 @@ cfg:
60
 
61
  Run:
62
  ```bash
63
- make unconditional
64
  ```
65
 
66
- ### 2. Inpainting
67
- Generate antibody sequences conditioned on specific CDR regions.
 
 
68
 
69
  Configuration (`inpaint.yaml`):
70
  ```yaml
@@ -77,9 +81,9 @@ cfg:
77
  h_cdr3_seq: ARDAGVPLDY
78
  sampling:
79
  inpaint_fn:
80
- num_steps: 300-1000 # Number of sampling steps (recommended: 300-1000)
81
  mask_fn:
82
- data_modes: # Specify which regidata modes to condition on
83
  - "h_cdr1_seq"
84
  - "h_cdr2_seq"
85
  - "h_cdr3_seq"
@@ -87,18 +91,24 @@ cfg:
87
 
88
  Run:
89
  ```bash
90
- make inpaint
91
  ```
92
 
93
  ### 3. Sequence Humanization
94
- Convert non-human antibody sequences into humanized versions.
 
 
 
 
95
 
96
  Configuration (`humanization.yaml`):
97
  ```yaml
98
  cfg:
99
  input:
100
- l_seq: "EVKLQQSGPGLVTPSQSLSITCTVSGFSLSDYGVHWVRQSPGQGLEWLGVIWAGGGTNYNSALMSRKSISKDNSKSQVFLKMNSLQADDTAVYYCARDKGYSYYYSMDYWGQGTSVTVSS"
101
- h_seq: "DIETLQSPASLAVSLGQRATISCRASESVEYYVTSLMQWYQQKPGQPPKLLIFAASNVESGVPARFSGSGSGTNFSLNIHPVDEDDVAMYFCQQSRKYVPYTFGGGTKLEIK"
 
 
102
  sampling:
103
  recycling_steps: 10 # Number of recycling steps (recommended: 5-12)
104
  inpaint_fn:
@@ -107,25 +117,96 @@ cfg:
107
 
108
  Run:
109
  ```bash
110
- make humanization
111
  ```
112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ## Citation
114
  If you use AbBFN2 in your research, please cite our work:
115
-
116
- ```text
117
- @article {,
118
- author = {},
119
- title = {AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks},
120
- elocation-id = {},
121
- year = {},
122
- doi = {},
123
- publisher = {},
124
- URL = {},
125
- eprint = {},
126
- journal = {}
127
- }
128
  ```
129
-
130
- ## License
131
- [License information to be added]
 
1
  # AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks
2
 
3
+ AbBFN2 allows for flexible task adaptation by virtue of its ability to condition the generative process on an arbitrary subset of variables. Further, since AbBFN2 is based on the Bayesian Flow Network paradigm, it can jointly model both discrete and continuous variables. Using this architecture, we provide a rich syntax which can be used to interact with the model. Regardless of conditioning information, the model generates all 45 "data modes" at inference time and arbitrary conditioning can be used to define specific tasks.
4
 
5
+ ## Getting Started
6
 
7
+ You can interact with AbBFN2 via:
 
8
 
9
+ * **Web Application:** [https://abbfn2.labs.deepchain.bio/](https://abbfn2.labs.deepchain.bio/)
10
+ * **Open-Source Repository:** [https://github.com/instadeepai/AbBFN2](https://github.com/instadeepai/AbBFN2)
11
 
12
+ The instructions below pertain to the open-source repository.
13
 
14
  ## Prerequisites
 
15
  - Docker installed on your system
16
  - Sufficient computational resources (TPU/GPU recommended)
17
+ - Basic understanding of antibody structure and sequence notation
18
 
19
  ## Installation
20
 
 
26
 
27
  Note: Multi-host inference is not supported in this release. Please use single-host settings only.
28
 
29
+ ### Building the Docker Image
30
  Run the following command to build the AbBFN2 Docker image:
31
  ```bash
32
  make build
33
  ```
34
  This process typically takes 5-20 minutes depending on your hardware.
35
 
36
+
37
+ ### For Apple Silicon users
38
+ Build the conda environment instead directly using:
39
  ```bash
40
  conda env create -f environment.yaml
41
+ conda activate abbfn2
42
  ```
 
 
43
 
44
  ## Usage
45
 
46
  AbBFN2 supports three main generation modes, each with its own configuration file in the `experiments/configs/` directory.
47
 
48
+ In addition to the mode-specific settings, configuration files contain options for loading model weights. By default (`load_from_hf: true`), weights are downloaded from Hugging Face. Optionally, if you have the weights locally, set `load_from_hf: false` and provide the path in `model_weights_path` (e.g., `/app/params.pkl`).
49
+
50
  ### 1. Unconditional Generation
51
+ Generate novel antibody sequences without any constraints. AbBFN2 will generate natural-like antibody sequences matching its training distribution. Note that the metadata labels are also predictions made by the model. For a discussion of the accuracy of these labels, please refer to the AbBFN2 manuscript.
52
 
53
  Configuration (`unconditional.yaml`):
54
  ```yaml
 
62
 
63
  Run:
64
  ```bash
65
+ make unconditional # or python experiments/unconditional.py for Apple Silicon users.
66
  ```
67
 
68
+ ### 2. Conditional Generation/Inpainting
69
+ Generate antibody sequences conditioned on specific attributes. Conditional generation highlights the flexibility of AbBFN2 and allows it to be task adaptible depending on the exact conditioning data. While any arbitrary combination is possible, conditional generation is mostly to be used primarily when conditioning on full sequences (referred to as sequence labelling in the manuscript), partial sequences (sequence inpainting), partial sequences and metadata (sequence design), metadata only (conditional de novo generation). For categorical variables, the set of of possible values is found in `src/abbfn2/data_mode_handler/oas_paired/constants.py`. For genes and CDR lengths, only values that appear at least 100 times in the training data are valid. When conditioning on species, human, mouse, or rat can be chosen.
70
+
71
+ **Disclaimer**: _As discussed in the manuscript, the flexibility of AbBFN2 requires careful consideration of the exact combination of conditioning information for effective generation. For instance, conditioning on a kappa light chain locus V-gene together with a lambda locus J-gene family is unlikely to yield samples of high quality. Such paradoxical combinations can also exist in more subtle ways. Due to the space of possible conditioning information, we have only tested a small subset of such combinations._
72
 
73
  Configuration (`inpaint.yaml`):
74
  ```yaml
 
81
  h_cdr3_seq: ARDAGVPLDY
82
  sampling:
83
  inpaint_fn:
84
+ num_steps: 300 # Number of sampling steps (recommended: 300-1000)
85
  mask_fn:
86
+ data_modes: # Specify which data modes to condition on
87
  - "h_cdr1_seq"
88
  - "h_cdr2_seq"
89
  - "h_cdr3_seq"
 
91
 
92
  Run:
93
  ```bash
94
+ make inpaint # or python experiments/inpaint.py for Apple Silicon users.
95
  ```
96
 
97
  ### 3. Sequence Humanization
98
+ Convert non-human antibody sequences into humanized versions. This workflow is designed to run a sequence humanisation experiment given a paired, non-human starting sequence. AbBFN2 will be used to introduce mutations to the framework regions of the starting antibody, possibly using several recycling iterations. During sequence humanisation, appropriate human V-gene families to target will also be chosen, but can be manually set by the user too.
99
+
100
+ Briefly, the humanisation workflow here uses the conditional generation capabilities of AbBFN2 in a sample recycling approach. At each iteration, further mutations are introduced, using a more aggressive starting strategy that is likely to introduce a larger number of mutations. As the sequence becomes more human under the model, fewer mutations are introduced at subsequent steps. Please note that we have found that in most cases, humanisation is achieved within a single recycling iteration. If the model introduces a change to the CDR loops, which can happen in rare cases, these are removed. For a detailed description of the humanisation workflow, please refer to the AbBFN2 manuscript.
101
+
102
+ Please also note that while we provide the option to manually select V-gene families here, this workflow allows the model to select more appropriate V-gene families during inference. Therefore, the final V-gene families may differ from the initially selected ones. Please also note that due to the data that AbBFN2 is trained on, humanisation will be most reliable when performed on murine or rat sequences. Sequences from other species have not been tested.
103
 
104
  Configuration (`humanization.yaml`):
105
  ```yaml
106
  cfg:
107
  input:
108
+ l_seq: "DIVLTQSPASLAVSLGQRATISCKASQSVDYDGHSYMNWYQQKPGQPPKLLIYAASNLESGIPARFSGSGSGTDFTLNIHPVEEEDAATYYCQQSDENPLTFGTGTKLELK"
109
+ h_seq: "QVQLQQSGPELVKPGALVKISCKASGYTFTSYDINWVKQRPGQGLEWIGWIYPGDGSIKYNEKFKGKATLTVDKSSSTAYMQVSSLTSENSAVYFCARRGEYGNYEGAMDYWGQGTTVTVSS"
110
+ # h_vfams: null # Optionally, set target v-gene families
111
+ # l_vfams: null
112
  sampling:
113
  recycling_steps: 10 # Number of recycling steps (recommended: 5-12)
114
  inpaint_fn:
 
117
 
118
  Run:
119
  ```bash
120
+ make humanization # or python experiments/humanization.py Apple Silicon users.
121
  ```
122
 
123
+ ## Data Modes
124
+
125
+ The data modes supported by AbBFN2 are detailed below.
126
+
127
+ ##### Heavy-Chain IMGT Regions
128
+
129
+ | Field | Type | Region (IMGT) | Description | Length Range (AA) |
130
+ |---------------|--------|-------------------------|--------------------------------------------|-------------------|
131
+ | `h_fwr1_seq` | string | FWR1 | Framework region 1 | 18 – 41 |
132
+ | `h_fwr2_seq` | string | FWR2 | Framework region 2 | 6 – 30 |
133
+ | `h_fwr3_seq` | string | FWR3 | Framework region 3 | 29 – 58 |
134
+ | `h_fwr4_seq` | string | FWR4 | Framework region 4 | 3 – 12 |
135
+ | `h_cdr1_seq` | string | CDR1 | Complementarity-determining region 1 | 1 – 22 |
136
+ | `h_cdr2_seq` | string | CDR2 | Complementarity-determining region 2 | 1 – 25 |
137
+ | `h_cdr3_seq` | string | CDR3 | Complementarity-determining region 3 | 2 – 58 |
138
+
139
+ ##### Light-Chain IMGT Regions
140
+
141
+ | Field | Type | Region (IMGT) | Description | Length Range (AA) |
142
+ |---------------|--------|-------------------------|--------------------------------------------|-------------------|
143
+ | `l_fwr1_seq` | string | FWR1 | Framework region 1 | 18 – 36 |
144
+ | `l_fwr2_seq` | string | FWR2 | Framework region 2 | 11 – 27 |
145
+ | `l_fwr3_seq` | string | FWR3 | Framework region 3 | 25 – 48 |
146
+ | `l_fwr4_seq` | string | FWR4 | Framework region 4 | 3 – 13 |
147
+ | `l_cdr1_seq` | string | CDR1 | Complementarity-determining region 1 | 1 – 20 |
148
+ | `l_cdr2_seq` | string | CDR2 | Complementarity-determining region 2 | 1 – 16 |
149
+ | `l_cdr3_seq` | string | CDR3 | Complementarity-determining region 3 | 1 – 27 |
150
+
151
+ ##### CDR Length Metrics
152
+
153
+ Possible values provided in [src/abbfn2/data_mode_handler/oas_paired/constants.py](src/abbfn2/data_mode_handler/oas_paired/constants.py).
154
+
155
+
156
+ | Field | Type | Description |
157
+ |-------------|------|---------------------------------|
158
+ | `h1_length` | int | CDR1 length (heavy chain) |
159
+ | `h2_length` | int | CDR2 length (heavy chain) |
160
+ | `h3_length` | int | CDR3 length (heavy chain) |
161
+ | `l1_length` | int | CDR1 length (light chain) |
162
+ | `l2_length` | int | CDR2 length (light chain) |
163
+ | `l3_length` | int | CDR3 length (light chain) |
164
+
165
+ ##### Gene and Family Annotations
166
+
167
+ Possible values provided in [src/abbfn2/data_mode_handler/oas_paired/constants.py](src/abbfn2/data_mode_handler/oas_paired/constants.py).
168
+
169
+ | Field | Type | Description |
170
+ |---------------|--------|------------------------------------|
171
+ | `hv_gene` | string | V gene segment (heavy) |
172
+ | `hd_gene` | string | D gene segment (heavy) |
173
+ | `hj_gene` | string | J gene segment (heavy) |
174
+ | `lv_gene` | string | V gene segment (light) |
175
+ | `lj_gene` | string | J gene segment (light) |
176
+ | `hv_family` | string | V gene family (heavy) |
177
+ | `hd_family` | string | D gene family (heavy) |
178
+ | `hj_family` | string | J gene family (heavy) |
179
+ | `lv_family` | string | V gene family (light) |
180
+ | `lj_family` | string | J gene family (light) |
181
+ | `species` | string | One of “human”, “rat”, “mouse” |
182
+ | `light_locus` | string | One of “K” (kappa) or “L” (lambda)|
183
+
184
+ ##### TAP Physicochemical Metrics
185
+
186
+ | Field | Type | Description | Range |
187
+ |--------------------|--------|---------------------------------------------|-----------------|
188
+ | `tap_psh` | float | Patch hydrophobicity | 72.0 – 300.0 |
189
+ | `tap_pnc` | float | Proportion of non-covalent contacts | 0.0 – 10.0 |
190
+ | `tap_ppc` | float | Proportion of polar contacts | 0.0 – 7.5 |
191
+ | `tap_sfvcsp` | float | Surface-exposed variable-chain charge score | –55.0 – 55.0 |
192
+ | `tap_psh_flag` | string | Hydrophobicity flag | “red“ / “amber“ / “green“ |
193
+ | `tap_pnc_flag` | string | Non-covalent contacts flag | “red“ / “amber“ / “green“ |
194
+ | `tap_ppc_flag` | string | Polar contacts flag | “red“ / “amber“ / “green“ |
195
+ | `tap_sfvcsp_flag` | string | Charge score flag | “red“ / “amber“ / “green“ |
196
+
197
+
198
+ ##### V- and J- Identity Scores
199
+
200
+ | Field | Type | Description | Range (%) |
201
+ |-----------------|--------|-----------------------------------|---------------|
202
+ | `h_v_identity` | float | Heavy-chain V segment identity | 64.0 – 100.0 |
203
+ | `h_d_identity` | float | Heavy-chain D segment identity | 74.0 – 100.0 |
204
+ | `h_j_identity` | float | Heavy-chain J segment identity | 74.0 – 100.0 |
205
+ | `l_v_identity` | float | Light-chain V segment identity | 66.0 – 100.0 |
206
+ | `l_j_identity` | float | Light-chain J segment identity | 77.0 – 100.0 |
207
+
208
  ## Citation
209
  If you use AbBFN2 in your research, please cite our work:
 
 
 
 
 
 
 
 
 
 
 
 
 
210
  ```
211
+ [Citation information to be added]
212
+ ```