ExInBERT / README.md

Upload folder using huggingface_hub

ea14954 verified about 2 months ago

4.01 kB

	---
	license: mit
	base_model:
	- google-bert/bert-base-uncased
	tags:
	- genomics
	- bioinformatics
	- DNA
	- sequence-classification
	- introns
	- exons
	- BERT
	---

	# Exons and Introns Classifier

	BERT finetuned model for classifying DNA sequences into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).

	---

	## Architecture

	- Base model: BERT-base-uncased
	- Approach: Full-sequence classification
	- Framework: PyTorch + Hugging Face Transformers

	---

	## Usage

	You can use this model through its own custom pipeline:

	```python
	from transformers import pipeline

	pipe = pipeline(
	task="bert-exon-intron-classification",
	model="GustavoHCruz/ExInBERT",
	trust_remote_code=True,
	)

	out = pipe(
	{
	"sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG",
	"organism": "Homo sapiens",
	"gene": "HLA-B",
	"before": "CCGAAGCCCCTCAGCCTGAGATGGG",
	"after": "AGCCATCTTCCCAGTCCACCGTCCC",
	}
	)

	print(out) # INTRON
	```

	This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:

	- Organism and gene names were truncated to 10 characters
	- Flanking sequences `before` and `after` were up to 25 nucleotides.

	The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.

	---

	## Custom Usage Information

	Prompt format:

	The model expects the following input format:

	```
	<\|SEQUENCE\|>[G][T][A][A]...<\|ORGANISM\|>Homo sapiens<\|GENE\|>HLA-B<\|FLANK_BEFORE\|>[C][C][G][A]...<\|FLANK_AFTER\|>[A][G][C][C]...
	```

	- `<\|SEQUENCE\|>`: Full DNA sequence. Maximum of 256 nucleotides.
	- `<\|ORGANISM\|>`: Optional organism name (truncated to a maximum of 10 characters in training).
	- `<\|GENE\|>`: Optional gene name (truncated to a maximum of 10 characters in training).
	- `<\|FLANK_BEFORE\|>` and `<\|FLANK_AFTER\|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.

	The model should predict the class label: 0 (Exon) or 1 (Intron).

	---

	## Dataset

	The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).

	---

	## Publications

	- Full Paper
	Achieved 2nd place at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
	DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
	- Short Paper
	Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
	DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).

	---

	## Training

	- Trained on an architecture with 8x H100 GPUs.

	---

	## Metrics

	Average accuracy: 0.9996

	\| Class \| Precision \| Recall \| F1-Score \|
	\| ---------- \| --------- \| ------ \| -------- \|
	\| Intron \| 0.9994 \| 0.9994 \| 0.9994 \|
	\| Exon \| 0.9997 \| 0.9997 \| 0.9997 \|

	---

	### Notes

	- Metrics were computed on a full isolated test set.
	- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
	- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).

	---

	## GitHub Repository

	The full code for data processing, model training, and inference is available on GitHub:
	[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)

	You can find scripts for:

	- Preprocessing GenBank sequences
	- Fine-tuning models
	- Evaluating and using the trained models