File size: 6,091 Bytes
8b205fa
 
 
 
 
 
 
 
 
 
 
 
 
30076c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ced258
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30076c7
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: cc-by-nc-sa-4.0
widget:
- text: ATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCA
tags:
- nodule
- promoters
- plant
datasets:
- lhui2010/plant-promoters-induced-in-nodules
---


# Nodule-AI—A deep learning model for Nodule-Specific Gene Identification

## Model Description

This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.

## How to Use

NVIDIA GPU is required

### Installation
```bash
conda create -n llms python=3.11
conda activate llms
pip install  'torch<2.4'  'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2'
```

It may take ~15 min for a fresh install

### Basic Inference
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "lhui2010/nodule-AI"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)

# Prepare input (3000bp promoter sequence)
promoter_sequence = "ATGCGTCTCA"*300  # your promoter here

# Tokenize and predict
inputs = tokenizer(
    promoter_sequence,
    return_tensors="pt",
    max_length=3000,
    truncation=True,
    padding="max_length"
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    nodule_prob = probs[0][1].item()

print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}")
```

The output should be like

```
Probability of nodule-specific regulation: 0.0021
```

### Calculation of Shapley scores



## Training Data

The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade:

| Data Category | Samples | Species Included |
|---------------|---------|------------------|
| Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
| Non-nodule promoters | 170,912 | Matching species background sets |

## Training Procedure

**Fine-tuning Parameters**:
- **Epochs**: 5
- **Batch size**: 8
- **Learning rate**: 1e-5 
- **Hardware**: 1 × Tesla V100 32GB GPU

## Evaluation

Performance on evaluation set (n=43285 sequences):

| Metric | Value |
|--------|-------|
| Accuracy | 0.90 |
| F1 Score | 0.90 |
| Precision | 0.85 |
| Recall | 0.96 |
| Matthews correlation | 0.80 |


## Citation

---

*Model card last updated: July 12, 2025*