lhui2010
/

nodule-AI

@@ -17,55 +17,6 @@ datasets:
 This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
-**Key Features**:
-- 7999 vocabulary size with specialized DNA tokenization
-- Optimized for promoter sequence analysis (typical input:3000 bp upstream of TSS)
-- Mamba architecture enabling efficient long-sequence processing
-- Classification head for nodule-specific gene identification
-## Intended Use
-This model is designed for **plant genomics researchers** studying root nodule symbiosis mechanisms. Specifically, it predicts whether a given promoter sequence regulates genes specifically induced in root nodules.
-Example applications:
-- Annotating novel plant genomes for nodule-related functions
-- Identifying regulatory motifs in nodule-specific promoters
-- Comparative analysis of promoter architectures across nodulating species
-## Training Data
-The model was fine-tuned on a large dataset of plant promoter sequences with nodule-induced expression patterns revealed through RNA-Seq:
-| Data Category | Samples | Species Included |
-|---------------|---------|------------------|
-| Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
-| Non-nodule promoters | 170,912 | Matching species background sets |
-**Sequence characteristics**:
-- 3000 bp upstream of transcription start site (TSS)
-- Balanced positive/negative representation
-- Large scale collection of nodulating species
-## Training Procedure
-**Fine-tuning Parameters**:
-- **Epochs**: 5
-- **Batch size**: 8
-- **Learning rate**: 1e-5
-- **Hardware**: 1 × Tesla V100 32GB GPU
-## Evaluation
-Performance on evaluation set (n=43285 sequences):
-| Metric | Value |
-|--------|-------|
-| Accuracy | 0.90 |
-| F1 Score | 0.90 |
-| Precision | 0.85 |
-| Recall | 0.96 |
-| Matthews correlation | 0.80 |
 ## How to Use
 NVIDIA GPU is required
@@ -115,7 +66,39 @@ The output should be like
 Probability of nodule-specific regulation: 0.0021
 ```
-### Calculate Shapley scores
 ## Citation

 This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
 ## How to Use
 NVIDIA GPU is required
 Probability of nodule-specific regulation: 0.0021
 ```
+### Calculation of Shapley scores
+## Training Data
+The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade:
+| Data Category | Samples | Species Included |
+|---------------|---------|------------------|
+| Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
+| Non-nodule promoters | 170,912 | Matching species background sets |
+## Training Procedure
+**Fine-tuning Parameters**:
+- **Epochs**: 5
+- **Batch size**: 8
+- **Learning rate**: 1e-5
+- **Hardware**: 1 × Tesla V100 32GB GPU
+## Evaluation
+Performance on evaluation set (n=43285 sequences):
+| Metric | Value |
+|--------|-------|
+| Accuracy | 0.90 |
+| F1 Score | 0.90 |
+| Precision | 0.85 |
+| Recall | 0.96 |
+| Matthews correlation | 0.80 |
 ## Citation