Blowfish

Introduction

**Blowfish는 분자 독성 예측을 수행하기 위해 개발된 대형 언어 모델(LLM)입니다.

Qwen3-14B를 기반으로 파인튜닝(Fine-tuning)되었으며, 단순한 이진 분류를 넘어 Chain-of-Thought (CoT) 방식을 통해 독성 판정의 화학적/생물학적 근거를 논리적으로 설명합니다.

사용자가 입력한 SMILES, Cell Line, Bio Assay, 그리고 주요 RDKit Features을 종합적으로 분석하여 최종적으로 독성 여부(독성 / 비독성)를 판단합니다.

주요 특징

Base Model: Qwen3-14B
Task: 이진 독성 예측 (Binary Toxicity Prediction) 및 분자 구조 분석
Language: 한국어 (시스템 지시문), 영어 (화학적 추론 및 답변)
Input Data:
- SMILES Code
- Cell Line / Cell Type
- Bio Assay Name
- RDKit Features (SHAP Value 기준 상/하위 Feature 각 3개)

프롬프트 형식

모델의 성능을 최적화하기 위해 학습 시 사용된 프롬프트 형식을 준수해야 합니다.

시스템 프롬프트 (System Prompt)

"당신은 분자 독성 예측에 특화된 화학정보학/독성학 전문가입니다. 사용자는 독성/비독성에 영향을 많이 끼치는 Feature 3개씩을 제공합니다... (중략) ... tool call을 사용하지 마세요."

사용자 입력 템플릿 (User Input Template)

SMILES: {smiles_code}
Cell Line: {cell_line}
Bio Assay Name: {endpoint_category}
Feature NL: {feature_NL_description}
Feature Descript: {feature_detailed_description}

{cot_instruction}

Inference

requirements

pip install transformers torch accelerate

Usage with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. 모델 및 토크나이저 로드
model_id = "TeamUNIVA/Blowfish"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. 시스템 프롬프트 정의
system_prompt = (
    "당신은 분자 독성 예측에 특화된 화학정보학/독성학 전문가입니다.\n"
    "사용자는 독성/비독성에 영향을 많이 끼치는 Feature 3개씩을 제공합니다.\n\n"
    "입력(사용자가 제공):\n"
    "- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
    "- 독성에 끼치는 영향이 큰 상위 3개 RDKit Feature\n"
    "- 비독성에 끼치는 영향이 큰 상위 3개 RDKit Feature\n\n"
    "수행 과업(Tasks):\n"
    "SMILES 구조 분석\n"
    "- 고리(방향족/지방족), 헤테로원자, 전하 중심, 반응성 모티프, H-결합 공여/수용기 등을\n"
    "  SMILES에서 직접 관찰 가능한 범위로만 기술.\n\n"
    "Cell Type, Cell Line, Assay Name 특징 분석 및 SMILES와 연결\n\n"
    "RDKit feature 분석\n"
    "- 각 feature가 의미하는 바와 일반적 독성 리스크에 주는 영향 요약.\n"
    "- 가능한 경우 Assay 맥락(예: ARE 산화스트레스)과 연결.\n\n"
    "종합 판단(최종 결론)\n"
    "- (1) SMILES 모티프, (2) Cell line/Cell type + Assay 맥락, (3) RDKit feature를 통합해\n"
    "  독성 여부를 이진으로 판단.\n\n"
    "출력 규칙:\n"
    "- 본문은 영어로 작성.\n"
    "- 마지막 줄에 아래 중 하나만 단독 표기:\n"
    "<answer>toxic</answer>\n"
    "<answer>nontoxic</answer>\n\n"
)

# 3. 입력 데이터 예시
smiles_code =  "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
cell_line = "HepG2 (Liver)"
feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
feature_descript = "Detailed feature descriptions"
bio_assay = "AhR"
instruction = "화합물 O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2의 독성/비독성 여부를 판단하시오."


# 4. 프롬프트 구성
user_content = (
    f"SMILES: {smiles_code}\n"
    f"Cell Line: {cell_line}\n"
    f"Bio Assay Name: {bio_assay}\n"
    f"Feature NL: {feature_NL}\n"
    f"Feature Descript: {feature_descript}\n\n"
    f"{instruction}"
)

# 5. 채팅 템플릿 적용 및 생성
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_content}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    temperature=0.7,
    top_p=0.8,
    do_sample=True
)

# 6. 결과 디코딩
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)