File size: 4,007 Bytes

---
license: mit
base_model:
  - google-bert/bert-base-uncased
tags:
  - genomics
  - bioinformatics
  - DNA
  - sequence-classification
  - introns
  - exons
  - BERT
---

# Exons and Introns Classifier

BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).

---

## Architecture

- Base model: BERT-base-uncased
- Approach: Full-sequence classification
- Framework: PyTorch + Hugging Face Transformers

---

## Usage

You can use this model through its own custom pipeline:

```python
from transformers import pipeline

pipe = pipeline(
  task="bert-exon-intron-classification",
  model="GustavoHCruz/ExInBERT",
  trust_remote_code=True,
)

out = pipe(
  {
    "sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG",
    "organism": "Homo sapiens",
    "gene": "HLA-B",
    "before": "CCGAAGCCCCTCAGCCTGAGATGGG",
    "after": "AGCCATCTTCCCAGTCCACCGTCCC",
  }
)

print(out) # INTRON
```

This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:

- Organism and gene names were truncated to 10 characters
- Flanking sequences `before` and `after` were up to 25 nucleotides.

The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.

---

## Custom Usage Information

Prompt format:

The model expects the following input format:

```
<|SEQUENCE|>[G][T][A][A]...<|ORGANISM|>Homo sapiens<|GENE|>HLA-B<|FLANK_BEFORE|>[C][C][G][A]...<|FLANK_AFTER|>[A][G][C][C]...
```

- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 256 nucleotides.
- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.

The model should predict the class label: 0 (Exon) or 1 (Intron).

---

## Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).

---

## Publications

- **Full Paper**  
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.  
  DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**  
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.  
  DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).

---

## Training

- Trained on an architecture with 8x H100 GPUs.

---

## Metrics

**Average accuracy:** **0.9996**

| Class      | Precision | Recall | F1-Score |
| ---------- | --------- | ------ | -------- |
| **Intron** | 0.9994    | 0.9994 | 0.9994   |
| **Exon**   | 0.9997    | 0.9997 | 0.9997   |

---

### Notes

- Metrics were computed on a full isolated test set.
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).

---

## GitHub Repository

The full code for **data processing, model training, and inference** is available on GitHub:  
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)

You can find scripts for:

- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models