Nucleotide Classifier

BERT finetuned model for classifying nucleotides into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).


Architecture

  • Base model: BERT-base-uncased
  • Approach: Nucleotide classification
  • Framework: PyTorch + Hugging Face Transformers

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
  task="bert-nucleotide-classification",
  model="GustavoHCruz/NuclBERT",
  trust_remote_code=True,
)

out = pipe(
  {
    "before": "ATGATCCAGTTAAAAAATATATTC",
    "sequence": "C",
    "after": "",
    "organism": "Rotaria socialiss"
    }
)

print(out) # EXON

out = pipe(
    {
    "before": "GTAACATTAAAATAAAAAACAAAA",
    "sequence": "T",
    "after": "ATTATTTAAAGAAAAATATAATTA",
    "organism": "Rotaria sp. Silwood1"
    }
)

print(out) # INTRON

The maximum context of this model is the same as BERT (512 tokens), but its training was limited to 24 nucleotides before and after the target nucleotide. Furthermore, the additional context organism was trained by being truncated to only 10 characters.

When using the pipeline, these rules will be applied. The sequence will be limited to only a single nucleotide, and the sequences before and after will be truncated to a maximum of 24 nucleotides, even if longer sequences are provided. Similarly, the organism will be limited to 10 characters.


Custom Usage Information

Prompt format:

The model expects the following input format:

<|SEQUENCE|>[DNA_T]<|FLANK_BEFORE|>[DNA_T][DNA_C][DNA_G]...<|FLANK_AFTER|>[DNA_A][DNA_G][DNA_C]...<|ORGANISM|>Homo sapiens...<|TARGET|>
  • <|SEQUENCE|>: One single nucleotide.
  • <|FLANK_BEFORE|> and <|FLANK_AFTER|>: Upstream/downstream context sequences. Use a maximum of 24 nucleotides.
  • <|ORGANISM|>: Optional organism name (truncated to a maximum of 10 characters in training).
  • <|TARGET|>: Separation token.

The model should predict a class label: 0 (Exon), 1 (Intron) or 2 (Unknown).


Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.


Publications


Training

  • Trained on an architecture with 8x H100 GPUs.

Metrics

Average accuracy: 0.8183

Class Precision Recall F1-Score
Intron 0.6455 0.7164 0.6791
Exon 0.8236 0.9214 0.8698
Unknown 0.8522 0.6974 0.7671

Notes

  • Metrics were computed on a full isolated test set.
  • The model can operate without additional biological features (organism).

GitHub Repository

The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers

You can find scripts for:

  • Preprocessing GenBank sequences
  • Fine-tuning models
  • Evaluating and using the trained models
Downloads last month
66
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GustavoHCruz/NuclBERT

Finetuned
(6258)
this model

Collection including GustavoHCruz/NuclBERT