Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +88 -33
bert_exon_intron_classification.py +169 -0
config.json +7 -3

README.md CHANGED Viewed

@@ -1,77 +1,132 @@
 ---
 license: mit
 base_model:
-- google-bert/bert-base-uncased
 tags:
-- genomics
-- bioinformatics
-- DNA
-- sequence-classification
-- introns
-- exons
-- BERT
 ---
 # Exons and Introns Classifier
-BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
 ## Architecture
-- Base model: BERT
 - Approach: Full-sequence classification
 - Framework: PyTorch + Hugging Face Transformers
 ## Usage
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-tokenizer = AutoTokenizer.from_pretrained("GustavoHCruz/ExInBERT")
-model = AutoModelForSequenceClassification.from_pretrained("GustavoHCruz/ExInBERT")
 ```
 Prompt format:
 The model expects the following input format:
 ```
-<|SEQUENCE|>ACGAAGGGTAAGCC...
-<|ORGANISM|>...
-<|GENE|>...
-<|FLANK_BEFORE|>ACGT...
-<|FLANK_AFTER|>ACGT...
 ```
-- `<|SEQUENCE|>`: Full DNA sequence.
 - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
 - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
-- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences.
 The model should predict the next token as the class label: 0 (Exon) or 1 (Intron).
-## Data
-The model was trained on a processed version of GenBank sequences spanning multiple species.
 ## Publications
-- **Full Paper – 2nd Place (National)**
   Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
-  [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575)
-- **Short Paper (International)**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
-  [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub:
 [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
-You can find scripts for:
-- Preprocessing GenBank sequences
-- Fine-tuning models
-- Evaluating and using the trained models

 ---
 license: mit
 base_model:
+  - google-bert/bert-base-uncased
 tags:
+  - genomics
+  - bioinformatics
+  - DNA
+  - sequence-classification
+  - introns
+  - exons
+  - BERT
 ---
 # Exons and Introns Classifier
+BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
+---
 ## Architecture
+- Base model: BERT-base-uncased
 - Approach: Full-sequence classification
 - Framework: PyTorch + Hugging Face Transformers
+---
 ## Usage
+You can use this model through its own custom pipeline:
+```python
+from transformers import pipeline
+pipe = pipeline(
+  task="bert-exon-intron-classification",
+  model="GustavoHCruz/ExInBERT",
+  trust_remote_code=True,
+)
+out = pipe(
+  {
+    "sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG",
+    "organism": "Homo sapiens",
+    "gene": "HLA-B",
+    "before": "CCGAAGCCCCTCAGCCTGAGATGGG",
+    "after": "AGCCATCTTCCCAGTCCACCGTCCC",
+  }
+)
+print(out) # INTRON
 ```
+This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:
+- Organism and gene names were truncated to 10 characters
+- Flanking sequences `before` and `after` were up to 25 nucleotides.
+The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.
+---
+## Custom Usage Information
 Prompt format:
 The model expects the following input format:
 ```
+<|SEQUENCE|>GCAG...<|ORGANISM|>Homo sapiens.<|GENE|>HLA-C<|FLANK_BEFORE|>GGTC...<|FLANK_AFTER|>GTGA...
 ```
+- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 256 nucleotides.
 - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
 - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
+- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
 The model should predict the next token as the class label: 0 (Exon) or 1 (Intron).
+---
+## Dataset
+The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
+---
 ## Publications
+- **Full Paper**
   Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
+  DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
+- **Short Paper**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
+  DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
+---
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
+---
+## Metrics
+**Average accuracy:** **0.9996**
+| Class      | Precision | Recall | F1-Score |
+| ---------- | --------- | ------ | -------- |
+| **Intron** | 0.9994    | 0.9994 | 0.9994   |
+| **Exon**   | 0.9997    | 0.9997 | 0.9997   |
+### Notes
+- Metrics were computed on the full test set.
+- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
+- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).
+---
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub:
 [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
+You can find scripts for:
+- Preprocessing GenBank sequences
+- Fine-tuning models
+- Evaluating and using the trained models

bert_exon_intron_classification.py ADDED Viewed

	@@ -0,0 +1,169 @@

+from typing import Any, Optional
+import torch
+from transformers import BertForSequenceClassification, Pipeline
+from transformers.pipelines import PIPELINE_REGISTRY
+from transformers.utils.generic import ModelOutput
+DNA_MAP = {
+	"A": "[A]",
+	"C": "[C]",
+	"G": "[G]",
+	"T": "[T]",
+	"R": "[R]",
+	"Y": "[Y]",
+	"S": "[S]",
+	"W": "[W]",
+	"K": "[K]",
+	"M": "[M]",
+	"B": "[B]",
+	"D": "[D]",
+	"H": "[H]",
+	"V": "[V]",
+	"N": "[N]"
+}
+def process_sequence(seq: str) -> str:
+	seq = seq.strip().upper()
+	return "".join(DNA_MAP.get(ch, "[N]") for ch in seq)
+def process_label(p: str) -> str:
+	return "EXON" if p == 0 else "INTRON"
+def ensure_optional_str(value: Any) -> Optional[str]:
+	return value if isinstance(value, str) else None
+class BERTExonIntronClassificationPipeline(Pipeline):
+	def _build_prompt(
+		self,
+		sequence: str,
+		organism: Optional[str],
+		gene: Optional[str],
+		before: Optional[str],
+		after: Optional[str]
+	) -> str:
+		out = f"<|SEQUENCE|>{process_sequence(sequence[:256])}"
+		if organism:
+			out += f"<|ORGANISM|>{organism[:10]}"
+		if gene:
+			out += f"<|GENE|>{gene[:10]}"
+		if before:
+			before_p = process_sequence(before[:25])
+			out += f"<|FLANK_BEFORE|>{before_p}"
+		if after:
+			after_p = process_sequence(after[:25])
+			out += f"|<FLANK_AFTER|>{after_p}"
+		return out
+	def _sanitize_parameters(
+		self,
+		**kwargs
+	):
+		preprocess_kwargs = {}
+		for k in ("organism", "gene", "before", "after", "max_length"):
+			if k in kwargs:
+				preprocess_kwargs[k] = kwargs[k]
+		forward_kwargs = {
+			k: v for k, v in kwargs.items()
+			if k not in preprocess_kwargs
+		}
+		postprocess_kwargs = {}
+		return preprocess_kwargs, forward_kwargs, postprocess_kwargs
+	def preprocess(
+		self,
+		input_,
+		**preprocess_parameters
+	):
+		assert self.tokenizer
+		if isinstance(input_, str):
+			sequence = input_
+		elif isinstance(input_, dict):
+			sequence = input_.get("sequence", "")
+		else:
+			raise TypeError("input_ must be str or dict with 'sequence' key")
+		organism_raw = preprocess_parameters.get("organism", None)
+		gene_raw = preprocess_parameters.get("gene", None)
+		before_raw = preprocess_parameters.get("before", None)
+		after_raw = preprocess_parameters.get("after", None)
+		if organism_raw is None and isinstance(input_, dict):
+			organism_raw = input_.get("organism", None)
+		if gene_raw is None and isinstance(input_, dict):
+			gene_raw = input_.get("gene", None)
+		if before_raw is None and isinstance(input_, dict):
+			before_raw = input_.get("before", None)
+		if after_raw is None and isinstance(input_, dict):
+			after_raw = input_.get("after", None)
+		organism: Optional[str] = ensure_optional_str(organism_raw)
+		gene: Optional[str] = ensure_optional_str(gene_raw)
+		before: Optional[str] = ensure_optional_str(before_raw)
+		after: Optional[str] = ensure_optional_str(after_raw)
+		max_length = preprocess_parameters.get("max_length", 256)
+		if not isinstance(max_length, int):
+			raise TypeError("max_length must be an int")
+		prompt = self._build_prompt(sequence, organism, gene, before, after)
+		token_kwargs: dict[str, Any] = {"return_tensors": "pt"}
+		token_kwargs["max_length"] = max_length
+		token_kwargs["truncation"] = True
+		enc = self.tokenizer(prompt, **token_kwargs).to(self.model.device)
+		return {"prompt": prompt, "inputs": enc}
+	def _forward(self, input_tensors: dict, **forward_params):
+		assert isinstance(self.model, BertForSequenceClassification)
+		kwargs = dict(forward_params)
+		inputs = input_tensors.get("inputs")
+		if inputs is None:
+			raise ValueError("Model inputs missing in input_tensors (expected key 'inputs').")
+		if hasattr(inputs, "items") and not isinstance(inputs, torch.Tensor):
+			try:
+				expanded_inputs: dict[str, torch.Tensor] = {k: v.to(self.model.device) if isinstance(v, torch.Tensor) else v for k, v in dict(inputs).items()}
+			except Exception:
+				expanded_inputs = {}
+				for k, v in dict(inputs).items():
+					expanded_inputs[k] = v.to(self.model.device) if isinstance(v, torch.Tensor) else v
+		else:
+			if isinstance(inputs, torch.Tensor):
+				expanded_inputs = {"input_ids": inputs.to(self.model.device)}
+			else:
+				expanded_inputs = {"input_ids": torch.tensor(inputs, device=self.model.device)}
+		self.model.eval()
+		with torch.no_grad():
+			outputs = self.model(**expanded_inputs, **kwargs)
+		pred_id = torch.argmax(outputs.logits, dim=-1).item()
+		return ModelOutput({"pred_id": pred_id})
+	def postprocess(self, model_outputs: dict, **kwargs):
+		assert self.tokenizer
+		pred_id = model_outputs["pred_id"]
+		return process_label(pred_id)
+PIPELINE_REGISTRY.register_pipeline(
+	"bert-exon-intron-classification",
+	pipeline_class=BERTExonIntronClassificationPipeline,
+	pt_model=BertForSequenceClassification,
+)

config.json CHANGED Viewed

@@ -1,9 +1,13 @@
 {
-  "architectures": [
-    "BertForSequenceClassification"
-  ],
   "attention_probs_dropout_prob": 0.1,
   "classifier_dropout": null,
   "dtype": "float32",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",

 {
+  "architectures": ["BertForSequenceClassification"],
   "attention_probs_dropout_prob": 0.1,
   "classifier_dropout": null,
+  "custom_pipelines": {
+    "gpt2-exon-intron-classification": {
+      "impl": "bert_exon_intron_classification.BERTExonIntronClassificationPipeline",
+      "pt": ["BertForSequenceClassification"]
+    }
+  },
   "dtype": "float32",
   "gradient_checkpointing": false,
   "hidden_act": "gelu",