GustavoHCruz commited on
Commit
10827f7
·
verified ·
1 Parent(s): 531590b

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +88 -33
  2. bert_exon_intron_classification.py +169 -0
  3. config.json +7 -3
README.md CHANGED
@@ -1,77 +1,132 @@
1
  ---
2
  license: mit
3
  base_model:
4
- - google-bert/bert-base-uncased
5
  tags:
6
- - genomics
7
- - bioinformatics
8
- - DNA
9
- - sequence-classification
10
- - introns
11
- - exons
12
- - BERT
13
  ---
14
 
15
  # Exons and Introns Classifier
16
 
17
- BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
 
 
18
 
19
  ## Architecture
20
- - Base model: BERT
 
21
  - Approach: Full-sequence classification
22
  - Framework: PyTorch + Hugging Face Transformers
23
-
 
 
24
  ## Usage
25
 
26
- ```python
27
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
28
 
29
- tokenizer = AutoTokenizer.from_pretrained("GustavoHCruz/ExInBERT")
30
- model = AutoModelForSequenceClassification.from_pretrained("GustavoHCruz/ExInBERT")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ```
32
 
 
 
 
 
 
 
 
 
 
 
 
33
  Prompt format:
34
 
35
  The model expects the following input format:
36
 
37
  ```
38
- <|SEQUENCE|>ACGAAGGGTAAGCC...
39
- <|ORGANISM|>...
40
- <|GENE|>...
41
- <|FLANK_BEFORE|>ACGT...
42
- <|FLANK_AFTER|>ACGT...
43
  ```
44
 
45
- - `<|SEQUENCE|>`: Full DNA sequence.
46
  - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
47
  - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
48
- - `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences.
49
 
50
  The model should predict the next token as the class label: 0 (Exon) or 1 (Intron).
51
 
52
- ## Data
 
 
53
 
54
- The model was trained on a processed version of GenBank sequences spanning multiple species.
 
 
55
 
56
  ## Publications
57
 
58
- - **Full Paper – 2nd Place (National)**
59
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
60
- [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575)
61
- - **Short Paper (International)**
62
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
63
- [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
64
-
 
 
65
  ## Training
66
 
67
  - Trained on an architecture with 8x H100 GPUs.
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ## GitHub Repository
70
 
71
  The full code for **data processing, model training, and inference** is available on GitHub:
72
  [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
73
 
74
- You can find scripts for:
75
- - Preprocessing GenBank sequences
76
- - Fine-tuning models
77
- - Evaluating and using the trained models
 
 
1
  ---
2
  license: mit
3
  base_model:
4
+ - google-bert/bert-base-uncased
5
  tags:
6
+ - genomics
7
+ - bioinformatics
8
+ - DNA
9
+ - sequence-classification
10
+ - introns
11
+ - exons
12
+ - BERT
13
  ---
14
 
15
  # Exons and Introns Classifier
16
 
17
+ BERT finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
18
+
19
+ ---
20
 
21
  ## Architecture
22
+
23
+ - Base model: BERT-base-uncased
24
  - Approach: Full-sequence classification
25
  - Framework: PyTorch + Hugging Face Transformers
26
+
27
+ ---
28
+
29
  ## Usage
30
 
31
+ You can use this model through its own custom pipeline:
 
32
 
33
+ ```python
34
+ from transformers import pipeline
35
+
36
+ pipe = pipeline(
37
+ task="bert-exon-intron-classification",
38
+ model="GustavoHCruz/ExInBERT",
39
+ trust_remote_code=True,
40
+ )
41
+
42
+ out = pipe(
43
+ {
44
+ "sequence": "GTAAGGAGGGGGATGAGGGGTCATATCTCTTCTCAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCCTCCTTTCCCAG",
45
+ "organism": "Homo sapiens",
46
+ "gene": "HLA-B",
47
+ "before": "CCGAAGCCCCTCAGCCTGAGATGGG",
48
+ "after": "AGCCATCTTCCCAGTCCACCGTCCC",
49
+ }
50
+ )
51
+
52
+ print(out) # INTRON
53
  ```
54
 
55
+ This model uses the same maximum context length as the standard BERT (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:
56
+
57
+ - Organism and gene names were truncated to 10 characters
58
+ - Flanking sequences `before` and `after` were up to 25 nucleotides.
59
+
60
+ The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.
61
+
62
+ ---
63
+
64
+ ## Custom Usage Information
65
+
66
  Prompt format:
67
 
68
  The model expects the following input format:
69
 
70
  ```
71
+ <|SEQUENCE|>GCAG...<|ORGANISM|>Homo sapiens.<|GENE|>HLA-C<|FLANK_BEFORE|>GGTC...<|FLANK_AFTER|>GTGA...
 
 
 
 
72
  ```
73
 
74
+ - `<|SEQUENCE|>`: Full DNA sequence. Maximum of 256 nucleotides.
75
  - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
76
  - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
77
+ - `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
78
 
79
  The model should predict the next token as the class label: 0 (Exon) or 1 (Intron).
80
 
81
+ ---
82
+
83
+ ## Dataset
84
 
85
+ The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
86
+
87
+ ---
88
 
89
  ## Publications
90
 
91
+ - **Full Paper**
92
  Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
93
+ DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
94
+ - **Short Paper**
95
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
96
+ DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
97
+
98
+ ---
99
+
100
  ## Training
101
 
102
  - Trained on an architecture with 8x H100 GPUs.
103
 
104
+ ---
105
+
106
+ ## Metrics
107
+
108
+ **Average accuracy:** **0.9996**
109
+
110
+ | Class | Precision | Recall | F1-Score |
111
+ | ---------- | --------- | ------ | -------- |
112
+ | **Intron** | 0.9994 | 0.9994 | 0.9994 |
113
+ | **Exon** | 0.9997 | 0.9997 | 0.9997 |
114
+
115
+ ### Notes
116
+
117
+ - Metrics were computed on the full test set.
118
+ - The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
119
+ - The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).
120
+
121
+ ---
122
+
123
  ## GitHub Repository
124
 
125
  The full code for **data processing, model training, and inference** is available on GitHub:
126
  [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
127
 
128
+ You can find scripts for:
129
+
130
+ - Preprocessing GenBank sequences
131
+ - Fine-tuning models
132
+ - Evaluating and using the trained models
bert_exon_intron_classification.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Optional
2
+
3
+ import torch
4
+ from transformers import BertForSequenceClassification, Pipeline
5
+ from transformers.pipelines import PIPELINE_REGISTRY
6
+ from transformers.utils.generic import ModelOutput
7
+
8
+ DNA_MAP = {
9
+ "A": "[A]",
10
+ "C": "[C]",
11
+ "G": "[G]",
12
+ "T": "[T]",
13
+ "R": "[R]",
14
+ "Y": "[Y]",
15
+ "S": "[S]",
16
+ "W": "[W]",
17
+ "K": "[K]",
18
+ "M": "[M]",
19
+ "B": "[B]",
20
+ "D": "[D]",
21
+ "H": "[H]",
22
+ "V": "[V]",
23
+ "N": "[N]"
24
+ }
25
+
26
+ def process_sequence(seq: str) -> str:
27
+ seq = seq.strip().upper()
28
+ return "".join(DNA_MAP.get(ch, "[N]") for ch in seq)
29
+
30
+ def process_label(p: str) -> str:
31
+ return "EXON" if p == 0 else "INTRON"
32
+
33
+ def ensure_optional_str(value: Any) -> Optional[str]:
34
+ return value if isinstance(value, str) else None
35
+
36
+ class BERTExonIntronClassificationPipeline(Pipeline):
37
+ def _build_prompt(
38
+ self,
39
+ sequence: str,
40
+ organism: Optional[str],
41
+ gene: Optional[str],
42
+ before: Optional[str],
43
+ after: Optional[str]
44
+ ) -> str:
45
+ out = f"<|SEQUENCE|>{process_sequence(sequence[:256])}"
46
+
47
+ if organism:
48
+ out += f"<|ORGANISM|>{organism[:10]}"
49
+
50
+ if gene:
51
+ out += f"<|GENE|>{gene[:10]}"
52
+
53
+ if before:
54
+ before_p = process_sequence(before[:25])
55
+ out += f"<|FLANK_BEFORE|>{before_p}"
56
+
57
+ if after:
58
+ after_p = process_sequence(after[:25])
59
+ out += f"|<FLANK_AFTER|>{after_p}"
60
+
61
+ return out
62
+
63
+ def _sanitize_parameters(
64
+ self,
65
+ **kwargs
66
+ ):
67
+ preprocess_kwargs = {}
68
+
69
+ for k in ("organism", "gene", "before", "after", "max_length"):
70
+ if k in kwargs:
71
+ preprocess_kwargs[k] = kwargs[k]
72
+
73
+ forward_kwargs = {
74
+ k: v for k, v in kwargs.items()
75
+ if k not in preprocess_kwargs
76
+ }
77
+
78
+ postprocess_kwargs = {}
79
+
80
+ return preprocess_kwargs, forward_kwargs, postprocess_kwargs
81
+
82
+ def preprocess(
83
+ self,
84
+ input_,
85
+ **preprocess_parameters
86
+ ):
87
+ assert self.tokenizer
88
+
89
+ if isinstance(input_, str):
90
+ sequence = input_
91
+ elif isinstance(input_, dict):
92
+ sequence = input_.get("sequence", "")
93
+ else:
94
+ raise TypeError("input_ must be str or dict with 'sequence' key")
95
+
96
+ organism_raw = preprocess_parameters.get("organism", None)
97
+ gene_raw = preprocess_parameters.get("gene", None)
98
+ before_raw = preprocess_parameters.get("before", None)
99
+ after_raw = preprocess_parameters.get("after", None)
100
+
101
+ if organism_raw is None and isinstance(input_, dict):
102
+ organism_raw = input_.get("organism", None)
103
+ if gene_raw is None and isinstance(input_, dict):
104
+ gene_raw = input_.get("gene", None)
105
+ if before_raw is None and isinstance(input_, dict):
106
+ before_raw = input_.get("before", None)
107
+ if after_raw is None and isinstance(input_, dict):
108
+ after_raw = input_.get("after", None)
109
+
110
+ organism: Optional[str] = ensure_optional_str(organism_raw)
111
+ gene: Optional[str] = ensure_optional_str(gene_raw)
112
+ before: Optional[str] = ensure_optional_str(before_raw)
113
+ after: Optional[str] = ensure_optional_str(after_raw)
114
+
115
+ max_length = preprocess_parameters.get("max_length", 256)
116
+ if not isinstance(max_length, int):
117
+ raise TypeError("max_length must be an int")
118
+
119
+ prompt = self._build_prompt(sequence, organism, gene, before, after)
120
+
121
+ token_kwargs: dict[str, Any] = {"return_tensors": "pt"}
122
+ token_kwargs["max_length"] = max_length
123
+ token_kwargs["truncation"] = True
124
+
125
+ enc = self.tokenizer(prompt, **token_kwargs).to(self.model.device)
126
+
127
+ return {"prompt": prompt, "inputs": enc}
128
+
129
+ def _forward(self, input_tensors: dict, **forward_params):
130
+ assert isinstance(self.model, BertForSequenceClassification)
131
+ kwargs = dict(forward_params)
132
+
133
+ inputs = input_tensors.get("inputs")
134
+
135
+ if inputs is None:
136
+ raise ValueError("Model inputs missing in input_tensors (expected key 'inputs').")
137
+
138
+ if hasattr(inputs, "items") and not isinstance(inputs, torch.Tensor):
139
+ try:
140
+ expanded_inputs: dict[str, torch.Tensor] = {k: v.to(self.model.device) if isinstance(v, torch.Tensor) else v for k, v in dict(inputs).items()}
141
+ except Exception:
142
+ expanded_inputs = {}
143
+ for k, v in dict(inputs).items():
144
+ expanded_inputs[k] = v.to(self.model.device) if isinstance(v, torch.Tensor) else v
145
+ else:
146
+ if isinstance(inputs, torch.Tensor):
147
+ expanded_inputs = {"input_ids": inputs.to(self.model.device)}
148
+ else:
149
+ expanded_inputs = {"input_ids": torch.tensor(inputs, device=self.model.device)}
150
+
151
+ self.model.eval()
152
+ with torch.no_grad():
153
+ outputs = self.model(**expanded_inputs, **kwargs)
154
+
155
+ pred_id = torch.argmax(outputs.logits, dim=-1).item()
156
+
157
+ return ModelOutput({"pred_id": pred_id})
158
+
159
+ def postprocess(self, model_outputs: dict, **kwargs):
160
+ assert self.tokenizer
161
+
162
+ pred_id = model_outputs["pred_id"]
163
+ return process_label(pred_id)
164
+
165
+ PIPELINE_REGISTRY.register_pipeline(
166
+ "bert-exon-intron-classification",
167
+ pipeline_class=BERTExonIntronClassificationPipeline,
168
+ pt_model=BertForSequenceClassification,
169
+ )
config.json CHANGED
@@ -1,9 +1,13 @@
1
  {
2
- "architectures": [
3
- "BertForSequenceClassification"
4
- ],
5
  "attention_probs_dropout_prob": 0.1,
6
  "classifier_dropout": null,
 
 
 
 
 
 
7
  "dtype": "float32",
8
  "gradient_checkpointing": false,
9
  "hidden_act": "gelu",
 
1
  {
2
+ "architectures": ["BertForSequenceClassification"],
 
 
3
  "attention_probs_dropout_prob": 0.1,
4
  "classifier_dropout": null,
5
+ "custom_pipelines": {
6
+ "gpt2-exon-intron-classification": {
7
+ "impl": "bert_exon_intron_classification.BERTExonIntronClassificationPipeline",
8
+ "pt": ["BertForSequenceClassification"]
9
+ }
10
+ },
11
  "dtype": "float32",
12
  "gradient_checkpointing": false,
13
  "hidden_act": "gelu",