File size: 6,237 Bytes
4f5edb7
 
 
477f711
 
 
 
 
 
 
 
 
 
 
 
 
251c4a4
 
4f5edb7
 
477f711
4f5edb7
 
 
 
 
477f711
 
4f5edb7
477f711
 
 
 
 
4f5edb7
 
 
 
477f711
4f5edb7
477f711
 
 
 
4f5edb7
 
477f711
4f5edb7
477f711
 
 
4f5edb7
 
502fd1f
 
 
 
 
251c4a4
 
 
 
 
 
 
 
 
 
 
 
 
 
502fd1f
4f5edb7
 
 
 
477f711
 
 
4f5edb7
 
477f711
 
 
 
 
4f5edb7
477f711
4f5edb7
477f711
 
 
4f5edb7
477f711
 
 
 
 
 
 
 
 
4f5edb7
477f711
 
 
 
 
 
 
 
 
 
4f5edb7
477f711
 
 
 
4f5edb7
477f711
 
 
4f5edb7
477f711
 
 
4f5edb7
477f711
 
 
 
 
4f5edb7
477f711
4f5edb7
 
477f711
4f5edb7
477f711
 
4f5edb7
477f711
 
 
 
4f5edb7
477f711
4f5edb7
477f711
 
 
 
 
4f5edb7
 
477f711
4f5edb7
477f711
4f5edb7
477f711
 
 
 
4f5edb7
 
477f711
4f5edb7
 
477f711
 
 
 
 
 
 
 
 
 
 
 
4f5edb7
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
base_model: unsloth/Qwen3-1.7B
library_name: peft
license: mit
datasets:
- moremilk/ToT-Biology
language:
- en
pipeline_tag: text-generation
tags:
- sft
- trl
- unsloth
- transformers
- biology
- science
metrics:
- accuracy
---

# Model Card for BioGenesis-ToT

## Model Details

### Model Description

BioGenesis-ToT is a fine-tuned version of Qwen3-1.7B, optimized for mechanistic reasoning and explanatory understanding in biology.
This model has been trained on the [moremilk/ToT-Biology](https://huggingface.co/datasets/moremilk/ToT-Biology) dataset β€” a reasoning-rich collection of biology questions emphasizing why and how processes occur, rather than simply what happens.

The model demonstrates strong capabilities in:
- Structured biological explanation generation
- Logical and causal reasoning
- Chain-of-thought (ToT) reasoning in scientific contexts
- Interdisciplinary biological analysis (e.g., bioengineering, medicine, ecology)


## Uses

### πŸš€ Intended Use

- Educational and scientific explanation generation
- Biological reasoning and tutoring applications
- Model interpretability research
- Training datasets for reasoning-focused LLMs


### ⚠️ Limitations

- Not a replacement for expert biological judgment
- May occasionally over-generalize or simplify complex phenomena
- Limited to reasoning quality within biological contexts (not trained for creative writing or coding)


## Evaluation

Evaluation on [emre/TARA_Turkish_LLM_Benchmark](https://huggingface.co/datasets/emre/TARA_Turkish_LLM_Benchmark)


| Category                                                 | BioGenesis-ToT | Qwen3-1.7B |
| -------------------------------------------------------- | -------------- | ---------- |   
| Scientific Explanation and Hypothesis Evaluation (RAG)   |   **66.36**    |   61.82    |
| Ethical Dilemma Assessment                               |   **55.45**    |   47.27    |
| Complex Scenario Analysis and Drawing Conclusions        |   **61.82**    |   59.09    |
| Constrained Creative Writing                             |   **18.18**    |   9.09     |
| Logical Inference (Text-Based)                           |   49.09        | **68.18**  |
| Mathematical Reasoning                                   |   **42.73**    |   37.27    |
| Planning and Optimization Problems (Text-Based)          |   **52.73**    |   25.45    |
| Python Code Analysis and Debugging                       |   **51.82**    |   50.00    |
| Generating SQL Query (From Schema/Meta)                  |   **39.09**    |   36.36    |
| Cause-Effect Relationship in Historical Events (RAG)     |   **77.27**    |   73.64    |
| **Overall**                                              |   **51.45**    |   46.82    |
 

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel


tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-1.7B",)
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-1.7B",
    device_map={"": 0}
)

model = PeftModel.from_pretrained(base_model,"khazarai/BioGenesis-ToT")

question = """
Describe the composition of the plasma membrane and explain how its structure relates to its function of selective permeability.
"""

messages = [
    {"role" : "user", "content" : question}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
    enable_thinking = True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2200,
    temperature = 0.6,
    top_p = 0.95,
    top_k = 20,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)
```

**For pipeline:** 
```python
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-1.7B")
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-1.7B")
model = PeftModel.from_pretrained(base_model, "khazarai/BioGenesis-ToT")

question = """
Describe the composition of the plasma membrane and explain how its structure relates to its function of selective permeability.
"""

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": question}
]
pipe(messages)

```


## πŸ§ͺ Dataset: moremilk/ToT-Biology

The ToT-Biology dataset emphasizes mechanistic understanding and explanatory reasoning within biology.
It’s designed to help AI models develop interpretable, step-by-step reasoning abilities for complex biological systems.

It spans a wide range of biological subdomains:
- Foundational biology: Cell biology, genetics, evolution, and ecology
- Advanced topics: Systems biology, synthetic biology, computational biophysics
- Applied domains: Medicine, agriculture, bioengineering, and environmental science

Dataset features include:

- 🧩 Logical reasoning styles β€” deductive, inductive, abductive, causal, and analogical
- 🧠 Problem-solving techniques β€” decomposition, elimination, systems thinking, trade-off analysis
- πŸ”¬ Real-world problem contexts β€” experiment design, pathway mapping, and data interpretation
- 🌍 Practical relevance β€” bridging theoretical reasoning and applied biological insight
- πŸŽ“ Educational focus β€” for both AI training and human learning in scientific reasoning


## 🧭 Objective

This fine-tuning project aims to build an interpretable reasoning model capable of:

- Explaining biological mechanisms clearly and coherently
- Demonstrating transparent, step-by-step thought processes
- Applying logical reasoning techniques to biological and interdisciplinary problems
- Supporting educational and research use cases where reasoning transparency matters


## Citation

**BibTeX:**
```bibtex
@model{khazarai/BioGenesis-ToT,
  title     = {BioGenesis-ToT: A Fine-Tuned Model for Explanatory Biological Reasoning},
  author    = {Rustam Shiriyev},
  year      = {2025},
  publisher = {Hugging Face},
  base_model = {Qwen3-1.7B},
  dataset   = {moremilk/ToT-Biology},
  license   = {MIT}
}

```

### Framework versions

- PEFT 0.15.2