|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
widget: |
|
|
- text: ATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCA |
|
|
tags: |
|
|
- nodule |
|
|
- promoters |
|
|
- plant |
|
|
datasets: |
|
|
- lhui2010/plant-promoters-induced-in-nodules |
|
|
--- |
|
|
|
|
|
|
|
|
# Nodule-AI—A deep learning model for Nodule-Specific Gene Identification |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
NVIDIA GPU is required |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
conda create -n llms python=3.11 |
|
|
conda activate llms |
|
|
pip install 'torch<2.4' 'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2' |
|
|
``` |
|
|
|
|
|
It may take ~15 min for a fresh install |
|
|
|
|
|
### Basic Inference |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "lhui2010/nodule-AI" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
# Prepare input (3000bp promoter sequence) |
|
|
promoter_sequence = "ATGCGTCTCA"*300 # your promoter here |
|
|
|
|
|
# Tokenize and predict |
|
|
inputs = tokenizer( |
|
|
promoter_sequence, |
|
|
return_tensors="pt", |
|
|
max_length=3000, |
|
|
truncation=True, |
|
|
padding="max_length" |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
nodule_prob = probs[0][1].item() |
|
|
|
|
|
print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}") |
|
|
``` |
|
|
|
|
|
The output should be like |
|
|
|
|
|
``` |
|
|
Probability of nodule-specific regulation: 0.0021 |
|
|
``` |
|
|
|
|
|
### Calculation of Shapley scores |
|
|
|
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade: |
|
|
|
|
|
| Data Category | Samples | Species Included | |
|
|
|---------------|---------|------------------| |
|
|
| Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* | |
|
|
| Non-nodule promoters | 170,912 | Matching species background sets | |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
**Fine-tuning Parameters**: |
|
|
- **Epochs**: 5 |
|
|
- **Batch size**: 8 |
|
|
- **Learning rate**: 1e-5 |
|
|
- **Hardware**: 1 × Tesla V100 32GB GPU |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Performance on evaluation set (n=43285 sequences): |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Accuracy | 0.90 | |
|
|
| F1 Score | 0.90 | |
|
|
| Precision | 0.85 | |
|
|
| Recall | 0.96 | |
|
|
| Matthews correlation | 0.80 | |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
--- |
|
|
|
|
|
*Model card last updated: July 12, 2025* |
|
|
|