Blowfish

Introduction

**Blowfish๋Š” ๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋œ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์ž…๋‹ˆ๋‹ค.

Qwen3-14B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ์ธํŠœ๋‹(Fine-tuning)๋˜์—ˆ์œผ๋ฉฐ, ๋‹จ์ˆœํ•œ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ๋„˜์–ด Chain-of-Thought (CoT) ๋ฐฉ์‹์„ ํ†ตํ•ด ๋…์„ฑ ํŒ์ •์˜ ํ™”ํ•™์ /์ƒ๋ฌผํ•™์  ๊ทผ๊ฑฐ๋ฅผ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ SMILES, Cell Line, Bio Assay, ๊ทธ๋ฆฌ๊ณ  ์ฃผ์š” RDKit Features์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋…์„ฑ ์—ฌ๋ถ€(๋…์„ฑ / ๋น„๋…์„ฑ)๋ฅผ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • Base Model: Qwen3-14B
  • Task: ์ด์ง„ ๋…์„ฑ ์˜ˆ์ธก (Binary Toxicity Prediction) ๋ฐ ๋ถ„์ž ๊ตฌ์กฐ ๋ถ„์„
  • Language: ํ•œ๊ตญ์–ด (์‹œ์Šคํ…œ ์ง€์‹œ๋ฌธ), ์˜์–ด (ํ™”ํ•™์  ์ถ”๋ก  ๋ฐ ๋‹ต๋ณ€)
  • Input Data:
    • SMILES Code
    • Cell Line / Cell Type
    • Bio Assay Name
    • RDKit Features (SHAP Value ๊ธฐ์ค€ ์ƒ/ํ•˜์œ„ Feature ๊ฐ 3๊ฐœ)

ํ”„๋กฌํ”„ํŠธ ํ˜•์‹

๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ์‹œ ์‚ฌ์šฉ๋œ ํ”„๋กฌํ”„ํŠธ ํ˜•์‹์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ (System Prompt)

"๋‹น์‹ ์€ ๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก์— ํŠนํ™”๋œ ํ™”ํ•™์ •๋ณดํ•™/๋…์„ฑํ•™ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋…์„ฑ/๋น„๋…์„ฑ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ผ์น˜๋Š” Feature 3๊ฐœ์”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค... (์ค‘๋žต) ... tool call์„ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์„ธ์š”."

์‚ฌ์šฉ์ž ์ž…๋ ฅ ํ…œํ”Œ๋ฆฟ (User Input Template)

SMILES: {smiles_code}
Cell Line: {cell_line}
Bio Assay Name: {endpoint_category}
Feature NL: {feature_NL_description}
Feature Descript: {feature_detailed_description}

{cot_instruction}

Inference

requirements

pip install transformers torch accelerate

Usage with transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
model_id = "TeamUNIVA/Blowfish"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ์ •์˜
system_prompt = (
    "๋‹น์‹ ์€ ๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก์— ํŠนํ™”๋œ ํ™”ํ•™์ •๋ณดํ•™/๋…์„ฑํ•™ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.\n"
    "์‚ฌ์šฉ์ž๋Š” ๋…์„ฑ/๋น„๋…์„ฑ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ผ์น˜๋Š” Feature 3๊ฐœ์”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.\n\n"
    "์ž…๋ ฅ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๊ณต):\n"
    "- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
    "- ๋…์„ฑ์— ๋ผ์น˜๋Š” ์˜ํ–ฅ์ด ํฐ ์ƒ์œ„ 3๊ฐœ RDKit Feature\n"
    "- ๋น„๋…์„ฑ์— ๋ผ์น˜๋Š” ์˜ํ–ฅ์ด ํฐ ์ƒ์œ„ 3๊ฐœ RDKit Feature\n\n"
    "์ˆ˜ํ–‰ ๊ณผ์—…(Tasks):\n"
    "SMILES ๊ตฌ์กฐ ๋ถ„์„\n"
    "- ๊ณ ๋ฆฌ(๋ฐฉํ–ฅ์กฑ/์ง€๋ฐฉ์กฑ), ํ—คํ…Œ๋กœ์›์ž, ์ „ํ•˜ ์ค‘์‹ฌ, ๋ฐ˜์‘์„ฑ ๋ชจํ‹ฐํ”„, H-๊ฒฐํ•ฉ ๊ณต์—ฌ/์ˆ˜์šฉ๊ธฐ ๋“ฑ์„\n"
    "  SMILES์—์„œ ์ง์ ‘ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„๋กœ๋งŒ ๊ธฐ์ˆ .\n\n"
    "Cell Type, Cell Line, Assay Name ํŠน์ง• ๋ถ„์„ ๋ฐ SMILES์™€ ์—ฐ๊ฒฐ\n\n"
    "RDKit feature ๋ถ„์„\n"
    "- ๊ฐ feature๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”์™€ ์ผ๋ฐ˜์  ๋…์„ฑ ๋ฆฌ์Šคํฌ์— ์ฃผ๋Š” ์˜ํ–ฅ ์š”์•ฝ.\n"
    "- ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ Assay ๋งฅ๋ฝ(์˜ˆ: ARE ์‚ฐํ™”์ŠคํŠธ๋ ˆ์Šค)๊ณผ ์—ฐ๊ฒฐ.\n\n"
    "์ข…ํ•ฉ ํŒ๋‹จ(์ตœ์ข… ๊ฒฐ๋ก )\n"
    "- (1) SMILES ๋ชจํ‹ฐํ”„, (2) Cell line/Cell type + Assay ๋งฅ๋ฝ, (3) RDKit feature๋ฅผ ํ†ตํ•ฉํ•ด\n"
    "  ๋…์„ฑ ์—ฌ๋ถ€๋ฅผ ์ด์ง„์œผ๋กœ ํŒ๋‹จ.\n\n"
    "์ถœ๋ ฅ ๊ทœ์น™:\n"
    "- ๋ณธ๋ฌธ์€ ์˜์–ด๋กœ ์ž‘์„ฑ.\n"
    "- ๋งˆ์ง€๋ง‰ ์ค„์— ์•„๋ž˜ ์ค‘ ํ•˜๋‚˜๋งŒ ๋‹จ๋… ํ‘œ๊ธฐ:\n"
    "<answer>toxic</answer>\n"
    "<answer>nontoxic</answer>\n\n"
)

# 3. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์˜ˆ์‹œ
smiles_code =  "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
cell_line = "HepG2 (Liver)"
feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
feature_descript = "Detailed feature descriptions"
bio_assay = "AhR"
instruction = "ํ™”ํ•ฉ๋ฌผ O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2์˜ ๋…์„ฑ/๋น„๋…์„ฑ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜์‹œ์˜ค."


# 4. ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ
user_content = (
    f"SMILES: {smiles_code}\n"
    f"Cell Line: {cell_line}\n"
    f"Bio Assay Name: {bio_assay}\n"
    f"Feature NL: {feature_NL}\n"
    f"Feature Descript: {feature_descript}\n\n"
    f"{instruction}"
)

# 5. ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ๋ฐ ์ƒ์„ฑ
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_content}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    temperature=0.7,
    top_p=0.8,
    do_sample=True
)

# 6. ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Acknowledgements

๋ณธ ๊ฒฐ๊ณผ๋ฌผ์€ ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€์™€ ํ•œ๊ตญ์ง€๋Šฅ์ •๋ณด์‚ฌํšŒ์ง„ํฅ์›์˜ ์ง€์›์„ ๋ฐ›์•„ ์ˆ˜ํ–‰๋œ ใ€Œ2025๋…„ ์ดˆ๊ฑฐ๋Œ€AI ํ™•์‚ฐ ์ƒํƒœ๊ณ„ ์กฐ์„ฑ์‚ฌ์—…ใ€์˜ ์—ฐ๊ตฌ ์„ฑ๊ณผ์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค.

Downloads last month
7
Safetensors
Model size
425k params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TeamUNIVA/Blowfish

Finetuned
Qwen/Qwen3-14B
Finetuned
(259)
this model
Quantizations
1 model