File size: 2,034 Bytes
59eb23b
 
ba31ca2
 
 
074b7a0
 
 
 
ebe9215
 
599cbc3
ebe9215
 
2cfa32b
7a7e4ca
 
 
2cfa32b
 
7a7e4ca
b17b6fd
 
2cfa32b
 
 
074b7a0
 
7a7e4ca
 
 
5d60a3a
 
2cfa32b
5d60a3a
 
b17b6fd
 
5d60a3a
 
ebe9215
b17b6fd
5d60a3a
 
b17b6fd
5d60a3a
 
 
b17b6fd
5d60a3a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: gpl-3.0
language:
- en
library_name: sklearn
---

Prediction of aerobicity (whether an bacteria or archaeon is aerobic) based on gene copy numbers. The prediction problem is posed as a 2-class problem (the prediction is either aerobic or anaerobic).

This predictor was used in this (currently pre-publication) manuscript, please cite it if appropriate:

Davín, A.A., Woodcroft, B.J., Soo, R.M., Morel, B., Murali, R., Schrempf, D., Clark, J.W., Álvarez-Carretero, S., Boussau, B., Moody, E.R. and Szánthó, L.L., 2025. A geological timescale for bacterial evolution and oxygen adaptation. Science, 388(6742), p.eadp1853. https://doi.org/10.1126/science.adp1853

## Installation

First ensure you have installed git-lfs (including running `git lfs install`), as described at https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

Then clone this repository, using 

```
git clone https://huggingface.co/wwood/aerobicity
git lfs fetch --all
git lfs pull
```

Then setup the conda environment:

```
cd aerobicity
mamba env create -p env -f env-apply.yml
conda activate ./env
```

and download the eggNOG database. We use version 2.1.3, as specified in the `env-apply.yml` conda environment file, because this is what the predictor was trained on. The eggNOG database is large, so it is not included in the repository. To download it, run:

```
mkdir eggNOG
download_eggnog_data.py --data_dir ./eggNOG
```

## Usage
To apply the predictor, run against a test genome:

```
./17_apply_to_proteome.py --protein-fasta data/RS_GCF_000515355.1_protein.faa --eggnog-data-dir eggNOG/ 
--models XGBoost.model --output-predictions predictions.csv
```

The predictions are then in `predictions.csv`. In the predictions output file, a prediction of `0` corresponds to a anaerobic prediction, and `1` corresponds to an aerobic prediction.

To run on your genomes, provide its protein fasta (i.e. the result of running `prodigal` on it), and use that instead of `data/RS_GCF_000515355.1_protein.faa` in the above command.