lhui2010 commited on
Commit
4ced258
·
verified ·
1 Parent(s): 8b205fa

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -50
README.md CHANGED
@@ -17,55 +17,6 @@ datasets:
17
 
18
  This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
19
 
20
- **Key Features**:
21
- - 7999 vocabulary size with specialized DNA tokenization
22
- - Optimized for promoter sequence analysis (typical input:3000 bp upstream of TSS)
23
- - Mamba architecture enabling efficient long-sequence processing
24
- - Classification head for nodule-specific gene identification
25
-
26
- ## Intended Use
27
-
28
- This model is designed for **plant genomics researchers** studying root nodule symbiosis mechanisms. Specifically, it predicts whether a given promoter sequence regulates genes specifically induced in root nodules.
29
-
30
- Example applications:
31
- - Annotating novel plant genomes for nodule-related functions
32
- - Identifying regulatory motifs in nodule-specific promoters
33
- - Comparative analysis of promoter architectures across nodulating species
34
-
35
- ## Training Data
36
-
37
- The model was fine-tuned on a large dataset of plant promoter sequences with nodule-induced expression patterns revealed through RNA-Seq:
38
-
39
- | Data Category | Samples | Species Included |
40
- |---------------|---------|------------------|
41
- | Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
42
- | Non-nodule promoters | 170,912 | Matching species background sets |
43
-
44
- **Sequence characteristics**:
45
- - 3000 bp upstream of transcription start site (TSS)
46
- - Balanced positive/negative representation
47
- - Large scale collection of nodulating species
48
-
49
- ## Training Procedure
50
-
51
- **Fine-tuning Parameters**:
52
- - **Epochs**: 5
53
- - **Batch size**: 8
54
- - **Learning rate**: 1e-5
55
- - **Hardware**: 1 × Tesla V100 32GB GPU
56
-
57
- ## Evaluation
58
-
59
- Performance on evaluation set (n=43285 sequences):
60
-
61
- | Metric | Value |
62
- |--------|-------|
63
- | Accuracy | 0.90 |
64
- | F1 Score | 0.90 |
65
- | Precision | 0.85 |
66
- | Recall | 0.96 |
67
- | Matthews correlation | 0.80 |
68
-
69
  ## How to Use
70
 
71
  NVIDIA GPU is required
@@ -115,7 +66,39 @@ The output should be like
115
  Probability of nodule-specific regulation: 0.0021
116
  ```
117
 
118
- ### Calculate Shapley scores
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ## Citation
121
 
 
17
 
18
  This model is a fine-tuned version of the [zhangtaolab/plant-dnamamba-BPE](https://huggingface.co/zhangtaolab/plant-dnamamba-BPE) architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## How to Use
21
 
22
  NVIDIA GPU is required
 
66
  Probability of nodule-specific regulation: 0.0021
67
  ```
68
 
69
+ ### Calculation of Shapley scores
70
+
71
+
72
+
73
+ ## Training Data
74
+
75
+ The model was fine-tuned on a large dataset of [plant promoter sequences with nodule-induced genes](https://huggingface.co/datasets/lhui2010/plant-promoters-induced-in-nodules) compiled from 14 plant genomes from the nitrogen-fixing clade:
76
+
77
+ | Data Category | Samples | Species Included |
78
+ |---------------|---------|------------------|
79
+ | Nodule-specific promoters | 175,365 | *Aeschynomene evenia*, *Alnus trabeculosa*, *Arachis hypogaea*, *Chamaecrista pumila*, *Coriaria nepalensis*, *Datisca glomerata*, *Elaeagnus umbellata*, *Glycine max*, *Hippophae rhamnoides*, *Lotus japonicus*, *Medicago truncatula*, *Mimosa pudica*, *Parasponia andersonii*, *Phaseolus vulgaris* |
80
+ | Non-nodule promoters | 170,912 | Matching species background sets |
81
+
82
+ ## Training Procedure
83
+
84
+ **Fine-tuning Parameters**:
85
+ - **Epochs**: 5
86
+ - **Batch size**: 8
87
+ - **Learning rate**: 1e-5
88
+ - **Hardware**: 1 × Tesla V100 32GB GPU
89
+
90
+ ## Evaluation
91
+
92
+ Performance on evaluation set (n=43285 sequences):
93
+
94
+ | Metric | Value |
95
+ |--------|-------|
96
+ | Accuracy | 0.90 |
97
+ | F1 Score | 0.90 |
98
+ | Precision | 0.85 |
99
+ | Recall | 0.96 |
100
+ | Matthews correlation | 0.80 |
101
+
102
 
103
  ## Citation
104