--- license: cc-by-nc-sa-4.0 widget: - text: ATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCAATGCGTCTCA tags: - nodule - promoters - plant datasets: - lhui2010/plant-promoters-induced-in-nodules --- # Nodule-AI—A deep learning model for Nodule-Specific Gene Identification ## Model Description This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning. ## How to Use NVIDIA GPU is required ### Installation ```bash conda create -n llms python=3.11 conda activate llms pip install 'torch<2.4' 'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2' ``` It may take ~15 min for a fresh install ### Basic Inference ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "lhui2010/nodule-AI" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True) # Prepare input (3000bp promoter sequence) promoter_sequence = "ATGCGTCTCA"*300 # your promoter here # Tokenize and predict inputs = tokenizer( promoter_sequence, return_tensors="pt", max_length=3000, truncation=True, padding="max_length" ) with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) nodule_prob = probs[0][1].item() print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}") ``` The output should be like ``` Probability of nodule-specific regulation: 0.0021 ``` ### Calculation of Shapley scores ## Training Data The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade: | Data Category | Samples | Species Included | |---------------|---------|------------------| | Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* | | Non-nodule promoters | 170,912 | Matching species background sets | ## Training Procedure **Fine-tuning Parameters**: - **Epochs**: 5 - **Batch size**: 8 - **Learning rate**: 1e-5 - **Hardware**: 1 × Tesla V100 32GB GPU ## Evaluation Performance on evaluation set (n=43285 sequences): | Metric | Value | |--------|-------| | Accuracy | 0.90 | | F1 Score | 0.90 | | Precision | 0.85 | | Recall | 0.96 | | Matthews correlation | 0.80 | ## Citation --- *Model card last updated: July 12, 2025*