--- license: mit base_model: - google-bert/bert-base-uncased tags: - genomics - bioinformatics - DNA - sequence-classification - introns - exons - BERT --- # Exons and Introns Classifier BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species). --- ## Architecture - Base model: BERT-base-uncased - Approach: Full-sequence classification - Framework: PyTorch + Hugging Face Transformers --- ## Usage You can use this model through its own custom pipeline: ```python from transformers import pipeline pipe = pipeline( task="bert-exon-intron-classification", model="GustavoHCruz/ExInBERT", trust_remote_code=True, ) out = pipe( { "sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG", "organism": "Homo sapiens", "gene": "HLA-B", "before": "CCGAAGCCCCTCAGCCTGAGATGGG", "after": "AGCCATCTTCCCAGTCCACCGTCCC", } ) print(out) # INTRON ``` This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules: - Organism and gene names were truncated to 10 characters - Flanking sequences `before` and `after` were up to 25 nucleotides. The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit. --- ## Custom Usage Information Prompt format: The model expects the following input format: ``` <|SEQUENCE|>[G][T][A][A]...<|ORGANISM|>Homo sapiens<|GENE|>HLA-B<|FLANK_BEFORE|>[C][C][G][A]...<|FLANK_AFTER|>[A][G][C][C]... ``` - `<|SEQUENCE|>`: Full DNA sequence. Maximum of 256 nucleotides. - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training). - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training). - `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides. The model should predict the class label: 0 (Exon) or 1 (Intron). --- ## Dataset The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions). --- ## Publications - **Full Paper** Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, CearĂ¡, Brazil. DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575). - **Short Paper** Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece. DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113). --- ## Training - Trained on an architecture with 8x H100 GPUs. --- ## Metrics **Average accuracy:** **0.9996** | Class | Precision | Recall | F1-Score | | ---------- | --------- | ------ | -------- | | **Intron** | 0.9994 | 0.9994 | 0.9994 | | **Exon** | 0.9997 | 0.9997 | 0.9997 | --- ### Notes - Metrics were computed on a full isolated test set. - The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores. - The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after). --- ## GitHub Repository The full code for **data processing, model training, and inference** is available on GitHub: [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers) You can find scripts for: - Preprocessing GenBank sequences - Fine-tuning models - Evaluating and using the trained models