File size: 4,007 Bytes
1f91170
 
 
10827f7
531590b
10827f7
 
 
 
 
 
 
531590b
 
 
 
10827f7
 
 
531590b
 
10827f7
 
531590b
 
10827f7
 
 
531590b
 
10827f7
531590b
10827f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531590b
 
10827f7
 
 
 
 
 
 
 
 
 
 
531590b
 
 
 
 
79d64c3
531590b
 
10827f7
531590b
 
10827f7
531590b
ea14954
531590b
10827f7
 
 
531590b
10827f7
 
 
531590b
 
 
10827f7
531590b
10827f7
 
531590b
10827f7
 
 
 
531590b
 
 
 
10827f7
 
 
 
 
 
 
 
 
 
 
ea14954
 
10827f7
 
79d64c3
10827f7
 
 
 
 
531590b
 
 
 
 
10827f7
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: mit
base_model:
  - google-bert/bert-base-uncased
tags:
  - genomics
  - bioinformatics
  - DNA
  - sequence-classification
  - introns
  - exons
  - BERT
---

# Exons and Introns Classifier

BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).

---

## Architecture

- Base model: BERT-base-uncased
- Approach: Full-sequence classification
- Framework: PyTorch + Hugging Face Transformers

---

## Usage

You can use this model through its own custom pipeline:

```python
from transformers import pipeline

pipe = pipeline(
  task="bert-exon-intron-classification",
  model="GustavoHCruz/ExInBERT",
  trust_remote_code=True,
)

out = pipe(
  {
    "sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG",
    "organism": "Homo sapiens",
    "gene": "HLA-B",
    "before": "CCGAAGCCCCTCAGCCTGAGATGGG",
    "after": "AGCCATCTTCCCAGTCCACCGTCCC",
  }
)

print(out) # INTRON
```

This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:

- Organism and gene names were truncated to 10 characters
- Flanking sequences `before` and `after` were up to 25 nucleotides.

The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.

---

## Custom Usage Information

Prompt format:

The model expects the following input format:

```
<|SEQUENCE|>[G][T][A][A]...<|ORGANISM|>Homo sapiens<|GENE|>HLA-B<|FLANK_BEFORE|>[C][C][G][A]...<|FLANK_AFTER|>[A][G][C][C]...
```

- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 256 nucleotides.
- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.

The model should predict the class label: 0 (Exon) or 1 (Intron).

---

## Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).

---

## Publications

- **Full Paper**  
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.  
  DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**  
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.  
  DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).

---

## Training

- Trained on an architecture with 8x H100 GPUs.

---

## Metrics

**Average accuracy:** **0.9996**

| Class      | Precision | Recall | F1-Score |
| ---------- | --------- | ------ | -------- |
| **Intron** | 0.9994    | 0.9994 | 0.9994   |
| **Exon**   | 0.9997    | 0.9997 | 0.9997   |

---

### Notes

- Metrics were computed on a full isolated test set.
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).

---

## GitHub Repository

The full code for **data processing, model training, and inference** is available on GitHub:  
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)

You can find scripts for:

- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models