--- license: mit base_model: - openai-community/gpt2 tags: - genomics - bioinformatics - DNA - sequence-classification - introns - exons - GPT --- # Exons and Introns Classifier GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species). --- ## Model Architecture - Base model: GPT-2 - Approach: Full-sequence classification --- ## Usage You can use this model through its own custom pipeline: ```python from transformers import pipeline pipe = pipeline( task="gpt2-exon-intron-classification", model="GustavoHCruz/ExInGPT", trust_remote_code=True, ) out = pipe( { "sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG", "organism": "Homo sapiens", "gene": "HLA-C", "before": "GGTCTTTTTTTTTGTTCTACCCCAG", "after": "GTGAGATTCTGGGGAGCTGAAGTGG", } ) print(out) # EXON ``` This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules: - Organism and gene names were truncated to 10 characters - Flanking sequences `before` and `after` were up to 25 nucleotides. The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit. --- ## Custom Usage Information Prompt format: The model expects the following input format: ``` <|SEQUENCE|>[G][C][A][G]... <|ORGANISM|>Homo sapiens <|GENE|>HLA-C <|FLANK_BEFORE|>[G][G][T][C]... <|FLANK_AFTER|>[G][T][G][A]... <|TARGET|> ``` - `<|SEQUENCE|>`: Full DNA sequence. Maximum of 512 nucleotides. - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training). - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training). - `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides. - `<|TARGET|>`: Separation token for label prediction. The model should predict the next token as the class label: `[EXON]` or `[INTRON]`. --- ## Dataset The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions). --- ## Publications - **Full Paper** Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil. DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575). - **Short Paper** Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece. DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113). --- ## Training - Trained on an architecture with 8x H100 GPUs. --- ## Metrics **Average accuracy:** **0.9985** | Class | Precision | Recall | F1-Score | | ---------- | --------- | ------ | -------- | | **Intron** | 0.9977 | 0.9973 | 0.9975 | | **Exon** | 0.9988 | 0.9990 | 0.9989 | --- ### Notes - Metrics were computed on a full isolated test set. - The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores. - The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after). --- ## GitHub Repository The full code for **data processing, model training, and inference** is available on GitHub: [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers) You can find scripts for: - Preprocessing GenBank sequences - Fine-tuning models - Evaluating and using the trained models