File size: 6,091 Bytes
8b205fa 30076c7 4ced258 30076c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: cc-by-nc-sa-4.0
widget:
- text: ATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCA
tags:
- nodule
- promoters
- plant
datasets:
- lhui2010/plant-promoters-induced-in-nodules
---
# Nodule-AI—A deep learning model for Nodule-Specific Gene Identification
## Model Description
This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
## How to Use
NVIDIA GPU is required
### Installation
```bash
conda create -n llms python=3.11
conda activate llms
pip install 'torch<2.4' 'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2'
```
It may take ~15 min for a fresh install
### Basic Inference
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "lhui2010/nodule-AI"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)
# Prepare input (3000bp promoter sequence)
promoter_sequence = "ATGCGTCTCA"*300 # your promoter here
# Tokenize and predict
inputs = tokenizer(
promoter_sequence,
return_tensors="pt",
max_length=3000,
truncation=True,
padding="max_length"
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
nodule_prob = probs[0][1].item()
print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}")
```
The output should be like
```
Probability of nodule-specific regulation: 0.0021
```
### Calculation of Shapley scores
## Training Data
The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade:
| Data Category | Samples | Species Included |
|---------------|---------|------------------|
| Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
| Non-nodule promoters | 170,912 | Matching species background sets |
## Training Procedure
**Fine-tuning Parameters**:
- **Epochs**: 5
- **Batch size**: 8
- **Learning rate**: 1e-5
- **Hardware**: 1 × Tesla V100 32GB GPU
## Evaluation
Performance on evaluation set (n=43285 sequences):
| Metric | Value |
|--------|-------|
| Accuracy | 0.90 |
| F1 Score | 0.90 |
| Precision | 0.85 |
| Recall | 0.96 |
| Matthews correlation | 0.80 |
## Citation
---
*Model card last updated: July 12, 2025*
|