BioGenesis-ToT / README.md

Update README.md

251c4a4 verified 3 months ago

6.24 kB

	---
	base_model: unsloth/Qwen3-1.7B
	library_name: peft
	license: mit
	datasets:
	- moremilk/ToT-Biology
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- sft
	- trl
	- unsloth
	- transformers
	- biology
	- science
	metrics:
	- accuracy
	---

	# Model Card for BioGenesis-ToT

	## Model Details

	### Model Description

	BioGenesis-ToT is a fine-tuned version of Qwen3-1.7B, optimized for mechanistic reasoning and explanatory understanding in biology.
	This model has been trained on the [moremilk/ToT-Biology](https://huggingface.co/datasets/moremilk/ToT-Biology) dataset — a reasoning-rich collection of biology questions emphasizing why and how processes occur, rather than simply what happens.

	The model demonstrates strong capabilities in:
	- Structured biological explanation generation
	- Logical and causal reasoning
	- Chain-of-thought (ToT) reasoning in scientific contexts
	- Interdisciplinary biological analysis (e.g., bioengineering, medicine, ecology)


	## Uses

	### 🚀 Intended Use

	- Educational and scientific explanation generation
	- Biological reasoning and tutoring applications
	- Model interpretability research
	- Training datasets for reasoning-focused LLMs


	### ⚠️ Limitations

	- Not a replacement for expert biological judgment
	- May occasionally over-generalize or simplify complex phenomena
	- Limited to reasoning quality within biological contexts (not trained for creative writing or coding)


	## Evaluation

	Evaluation on [emre/TARA_Turkish_LLM_Benchmark](https://huggingface.co/datasets/emre/TARA_Turkish_LLM_Benchmark)


	\| Category \| BioGenesis-ToT \| Qwen3-1.7B \|
	\| -------------------------------------------------------- \| -------------- \| ---------- \|
	\| Scientific Explanation and Hypothesis Evaluation (RAG) \| 66.36 \| 61.82 \|
	\| Ethical Dilemma Assessment \| 55.45 \| 47.27 \|
	\| Complex Scenario Analysis and Drawing Conclusions \| 61.82 \| 59.09 \|
	\| Constrained Creative Writing \| 18.18 \| 9.09 \|
	\| Logical Inference (Text-Based) \| 49.09 \| 68.18 \|
	\| Mathematical Reasoning \| 42.73 \| 37.27 \|
	\| Planning and Optimization Problems (Text-Based) \| 52.73 \| 25.45 \|
	\| Python Code Analysis and Debugging \| 51.82 \| 50.00 \|
	\| Generating SQL Query (From Schema/Meta) \| 39.09 \| 36.36 \|
	\| Cause-Effect Relationship in Historical Events (RAG) \| 77.27 \| 73.64 \|
	\| Overall \| 51.45 \| 46.82 \|


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel


	tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-1.7B",)
	base_model = AutoModelForCausalLM.from_pretrained(
	"unsloth/Qwen3-1.7B",
	device_map={"": 0}
	)

	model = PeftModel.from_pretrained(base_model,"khazarai/BioGenesis-ToT")

	question = """
	Describe the composition of the plasma membrane and explain how its structure relates to its function of selective permeability.
	"""

	messages = [
	{"role" : "user", "content" : question}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize = False,
	add_generation_prompt = True,
	enable_thinking = True,
	)

	from transformers import TextStreamer
	_ = model.generate(
	**tokenizer(text, return_tensors = "pt").to("cuda"),
	max_new_tokens = 2200,
	temperature = 0.6,
	top_p = 0.95,
	top_k = 20,
	streamer = TextStreamer(tokenizer, skip_prompt = True),
	)
	```

	For pipeline:
	```python
	from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-1.7B")
	base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-1.7B")
	model = PeftModel.from_pretrained(base_model, "khazarai/BioGenesis-ToT")

	question = """
	Describe the composition of the plasma membrane and explain how its structure relates to its function of selective permeability.
	"""

	pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
	messages = [
	{"role": "user", "content": question}
	]
	pipe(messages)

	```


	## 🧪 Dataset: moremilk/ToT-Biology

	The ToT-Biology dataset emphasizes mechanistic understanding and explanatory reasoning within biology.
	It’s designed to help AI models develop interpretable, step-by-step reasoning abilities for complex biological systems.

	It spans a wide range of biological subdomains:
	- Foundational biology: Cell biology, genetics, evolution, and ecology
	- Advanced topics: Systems biology, synthetic biology, computational biophysics
	- Applied domains: Medicine, agriculture, bioengineering, and environmental science

	Dataset features include:

	- 🧩 Logical reasoning styles — deductive, inductive, abductive, causal, and analogical
	- 🧠 Problem-solving techniques — decomposition, elimination, systems thinking, trade-off analysis
	- 🔬 Real-world problem contexts — experiment design, pathway mapping, and data interpretation
	- 🌍 Practical relevance — bridging theoretical reasoning and applied biological insight
	- 🎓 Educational focus — for both AI training and human learning in scientific reasoning


	## 🧭 Objective

	This fine-tuning project aims to build an interpretable reasoning model capable of:

	- Explaining biological mechanisms clearly and coherently
	- Demonstrating transparent, step-by-step thought processes
	- Applying logical reasoning techniques to biological and interdisciplinary problems
	- Supporting educational and research use cases where reasoning transparency matters


	## Citation

	BibTeX:
	```bibtex
	@model{khazarai/BioGenesis-ToT,
	title = {BioGenesis-ToT: A Fine-Tuned Model for Explanatory Biological Reasoning},
	author = {Rustam Shiriyev},
	year = {2025},
	publisher = {Hugging Face},
	base_model = {Qwen3-1.7B},
	dataset = {moremilk/ToT-Biology},
	license = {MIT}
	}

	```

	### Framework versions

	- PEFT 0.15.2