File size: 2,881 Bytes
658fb0f
 
 
fb7f477
a8d4848
fb7f477
 
 
 
 
 
 
a8d4848
fb7f477
a8d4848
 
fb7f477
 
 
a8d4848
 
fb7f477
a8d4848
 
fb7f477
 
 
a8d4848
 
fb7f477
 
a8d4848
fb7f477
 
 
 
 
 
 
 
 
 
 
 
 
a8d4848
 
fb7f477
 
 
 
 
a8d4848
fb7f477
 
 
 
 
 
 
a8d4848
518caa3
a8d4848
fb7f477
 
 
a8d4848
fb7f477
 
 
a8d4848
 
 
fb7f477
a8d4848
fb7f477
 
a8d4848
fb7f477
 
 
 
a8d4848
 
 
 
fb7f477
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8d4848
 
 
 
 
fb7f477
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: mit
base_model:
  - zhihan1996/DNABERT-2-117M
tags:
  - genomics
  - bioinformatics
  - DNA
  - sequence-classification
  - introns
  - exons
  - DNABERT2
---

# Exons and Introns Classifier

DNABERT2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).

---

## Architecture

- Base model: DNABERT2
- Approach: Full-sequence classification

---

## Usage

You can use this model through its own custom pipeline:

```python
from transformers import pipeline

pipe = pipeline(
  task="dnabert2-exon-intron-classification",
  model="GustavoHCruz/ExInDNABERT2",
  trust_remote_code=True,
)

out = pipe(
  "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG"
)

print(out) # EXON
```

This model uses the same maximum context length as the standard DNABERT2 (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides.

The pipeline will automatically truncate the nucleotide sequence they exceed this limit.

---

## Custom Usage Information

The model expects the same tokens as DNABERT2, ou seja, nucleotídeos de entrada, como por exemplo

```
GTAAGGAGGGGGAT
```

The model should predict the class label: 0 (Intron) or 1 (Exon).

---

## Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).

---

## Publications

- **Full Paper**  
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.  
  DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**  
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.  
  DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).

---

## Training

- Trained on an architecture with 8x H100 GPUs.

---

## Metrics

**Average accuracy:** **0.9956**

| Class      | Precision | Recall | F1-Score |
| ---------- | --------- | ------ | -------- |
| **Intron** | 0.9943    | 0.9922 | 0.9932   |
| **Exon**   | 0.9962    | 0.9972 | 0.9967   |

### Notes

- Metrics were computed on a full isolated test set.
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.

---

## GitHub Repository

The full code for **data processing, model training, and inference** is available on GitHub:  
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)

You can find scripts for:

- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models