Blowfish / Readme.md

Update Readme.md

ee7dced verified 14 days ago

5.17 kB

	---
	language:
	- ko
	- en
	tags:
	- chemistry
	- biology
	- toxicology
	license: apache-2.0
	base_model: Qwen/Qwen3-14B
	---

	# Blowfish

	## Introduction

	Blowfish는 분자 독성 예측**을 수행하기 위해 개발된 대형 언어 모델(LLM)입니다.

	Qwen3-14B를 기반으로 파인튜닝(Fine-tuning)되었으며, 단순한 이진 분류를 넘어 Chain-of-Thought (CoT) 방식을 통해 독성 판정의 화학적/생물학적 근거를 논리적으로 설명합니다.

	사용자가 입력한 SMILES, Cell Line, Bio Assay, 그리고 주요 RDKit Features을 종합적으로 분석하여 최종적으로 독성 여부(`독성` / `비독성`)를 판단합니다.

	### 주요 특징
	* Base Model: Qwen3-14B
	* Task: 이진 독성 예측 (Binary Toxicity Prediction) 및 분자 구조 분석
	* Language: 한국어 (시스템 지시문), 영어 (화학적 추론 및 답변)
	* Input Data:
	- SMILES Code
	- Cell Line / Cell Type
	- Bio Assay Name
	- RDKit Features (SHAP Value 기준 상/하위 Feature 각 3개)

	---

	## 프롬프트 형식

	모델의 성능을 최적화하기 위해 학습 시 사용된 프롬프트 형식을 준수해야 합니다.

	### 시스템 프롬프트 (System Prompt)
	> "당신은 분자 독성 예측에 특화된 화학정보학/독성학 전문가입니다. 사용자는 독성/비독성에 영향을 많이 끼치는 Feature 3개씩을 제공합니다... (중략) ... tool call을 사용하지 마세요."

	### 사용자 입력 템플릿 (User Input Template)

	```
	SMILES: {smiles_code}
	Cell Line: {cell_line}
	Bio Assay Name: {endpoint_category}
	Feature NL: {feature_NL_description}
	Feature Descript: {feature_detailed_description}

	{cot_instruction}
	```

	---

	# Inference

	## requirements
	```bash
	pip install transformers torch accelerate
	```

	## Usage with transformers

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# 1. 모델 및 토크나이저 로드
	model_id = "TeamUNIVA/Blowfish"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# 2. 시스템 프롬프트 정의
	system_prompt = (
	"당신은 분자 독성 예측에 특화된 화학정보학/독성학 전문가입니다.\n"
	"사용자는 독성/비독성에 영향을 많이 끼치는 Feature 3개씩을 제공합니다.\n\n"
	"입력(사용자가 제공):\n"
	"- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
	"- 독성에 끼치는 영향이 큰 상위 3개 RDKit Feature\n"
	"- 비독성에 끼치는 영향이 큰 상위 3개 RDKit Feature\n\n"
	"수행 과업(Tasks):\n"
	"SMILES 구조 분석\n"
	"- 고리(방향족/지방족), 헤테로원자, 전하 중심, 반응성 모티프, H-결합 공여/수용기 등을\n"
	" SMILES에서 직접 관찰 가능한 범위로만 기술.\n\n"
	"Cell Type, Cell Line, Assay Name 특징 분석 및 SMILES와 연결\n\n"
	"RDKit feature 분석\n"
	"- 각 feature가 의미하는 바와 일반적 독성 리스크에 주는 영향 요약.\n"
	"- 가능한 경우 Assay 맥락(예: ARE 산화스트레스)과 연결.\n\n"
	"종합 판단(최종 결론)\n"
	"- (1) SMILES 모티프, (2) Cell line/Cell type + Assay 맥락, (3) RDKit feature를 통합해\n"
	" 독성 여부를 이진으로 판단.\n\n"
	"출력 규칙:\n"
	"- 본문은 영어로 작성.\n"
	"- 마지막 줄에 아래 중 하나만 단독 표기:\n"
	"<answer>toxic</answer>\n"
	"<answer>nontoxic</answer>\n\n"
	)

	# 3. 입력 데이터 예시
	smiles_code = "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
	cell_line = "HepG2 (Liver)"
	feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
	feature_descript = "Detailed feature descriptions"
	bio_assay = "AhR"
	instruction = "화합물 O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2의 독성/비독성 여부를 판단하시오."


	# 4. 프롬프트 구성
	user_content = (
	f"SMILES: {smiles_code}\n"
	f"Cell Line: {cell_line}\n"
	f"Bio Assay Name: {bio_assay}\n"
	f"Feature NL: {feature_NL}\n"
	f"Feature Descript: {feature_descript}\n\n"
	f"{instruction}"
	)

	# 5. 채팅 템플릿 적용 및 생성
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_content}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=8192,
	temperature=0.7,
	top_p=0.8,
	do_sample=True
	)

	# 6. 결과 디코딩
	response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
	print(response)
	```

	## Acknowledgements
	본 결과물은 과학기술정보통신부와 한국지능정보사회진흥원의 지원을 받아 수행된
	「2025년 초거대AI 확산 생태계 조성사업」의 연구 성과의 일부입니다.