Blowfish / Readme.md
UNIVA-Jason's picture
Update Readme.md
ee7dced verified
---
language:
- ko
- en
tags:
- chemistry
- biology
- toxicology
license: apache-2.0
base_model: Qwen/Qwen3-14B
---
# Blowfish
## Introduction
**BlowfishλŠ” **λΆ„μž 독성 예츑**을 μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ 개발된 λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈ(LLM)μž…λ‹ˆλ‹€.
**Qwen3-14B**λ₯Ό 기반으둜 νŒŒμΈνŠœλ‹(Fine-tuning)λ˜μ—ˆμœΌλ©°, λ‹¨μˆœν•œ 이진 λΆ„λ₯˜λ₯Ό λ„˜μ–΄ **Chain-of-Thought (CoT)** 방식을 톡해 독성 νŒμ •μ˜ 화학적/생물학적 κ·Όκ±°λ₯Ό λ…Όλ¦¬μ μœΌλ‘œ μ„€λͺ…ν•©λ‹ˆλ‹€.
μ‚¬μš©μžκ°€ μž…λ ₯ν•œ **SMILES**, **Cell Line**, **Bio Assay**, 그리고 **μ£Όμš” RDKit Features**을 μ’…ν•©μ μœΌλ‘œ λΆ„μ„ν•˜μ—¬ μ΅œμ’…μ μœΌλ‘œ 독성 μ—¬λΆ€(`독성` / `비독성`)λ₯Ό νŒλ‹¨ν•©λ‹ˆλ‹€.
### μ£Όμš” νŠΉμ§•
* **Base Model:** Qwen3-14B
* **Task:** 이진 독성 예츑 (Binary Toxicity Prediction) 및 λΆ„μž ꡬ쑰 뢄석
* **Language:** ν•œκ΅­μ–΄ (μ‹œμŠ€ν…œ μ§€μ‹œλ¬Έ), μ˜μ–΄ (화학적 μΆ”λ‘  및 λ‹΅λ³€)
* **Input Data:**
- SMILES Code
- Cell Line / Cell Type
- Bio Assay Name
- RDKit Features (SHAP Value κΈ°μ€€ 상/ν•˜μœ„ Feature 각 3개)
---
## ν”„λ‘¬ν”„νŠΈ ν˜•μ‹
λͺ¨λΈμ˜ μ„±λŠ₯을 μ΅œμ ν™”ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅ μ‹œ μ‚¬μš©λœ ν”„λ‘¬ν”„νŠΈ ν˜•μ‹μ„ μ€€μˆ˜ν•΄μ•Ό ν•©λ‹ˆλ‹€.
### μ‹œμŠ€ν…œ ν”„λ‘¬ν”„νŠΈ (System Prompt)
> "당신은 λΆ„μž 독성 μ˜ˆμΈ‘μ— νŠΉν™”λœ 화학정보학/독성학 μ „λ¬Έκ°€μž…λ‹ˆλ‹€. μ‚¬μš©μžλŠ” 독성/비독성에 영ν–₯을 많이 λΌμΉ˜λŠ” Feature 3κ°œμ”©μ„ μ œκ³΅ν•©λ‹ˆλ‹€... (μ€‘λž΅) ... tool call을 μ‚¬μš©ν•˜μ§€ λ§ˆμ„Έμš”."
### μ‚¬μš©μž μž…λ ₯ ν…œν”Œλ¦Ώ (User Input Template)
```
SMILES: {smiles_code}
Cell Line: {cell_line}
Bio Assay Name: {endpoint_category}
Feature NL: {feature_NL_description}
Feature Descript: {feature_detailed_description}
{cot_instruction}
```
---
# Inference
## requirements
```bash
pip install transformers torch accelerate
```
## Usage with transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ € λ‘œλ“œ
model_id = "TeamUNIVA/Blowfish"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 2. μ‹œμŠ€ν…œ ν”„λ‘¬ν”„νŠΈ μ •μ˜
system_prompt = (
"당신은 λΆ„μž 독성 μ˜ˆμΈ‘μ— νŠΉν™”λœ 화학정보학/독성학 μ „λ¬Έκ°€μž…λ‹ˆλ‹€.\n"
"μ‚¬μš©μžλŠ” 독성/비독성에 영ν–₯을 많이 λΌμΉ˜λŠ” Feature 3κ°œμ”©μ„ μ œκ³΅ν•©λ‹ˆλ‹€.\n\n"
"μž…λ ₯(μ‚¬μš©μžκ°€ 제곡):\n"
"- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
"- 독성에 λΌμΉ˜λŠ” 영ν–₯이 큰 μƒμœ„ 3개 RDKit Feature\n"
"- 비독성에 λΌμΉ˜λŠ” 영ν–₯이 큰 μƒμœ„ 3개 RDKit Feature\n\n"
"μˆ˜ν–‰ κ³Όμ—…(Tasks):\n"
"SMILES ꡬ쑰 뢄석\n"
"- 고리(λ°©ν–₯μ‘±/μ§€λ°©μ‘±), ν—€ν…Œλ‘œμ›μž, μ „ν•˜ 쀑심, λ°˜μ‘μ„± λͺ¨ν‹°ν”„, H-κ²°ν•© 곡여/수용기 등을\n"
" SMILESμ—μ„œ 직접 κ΄€μ°° κ°€λŠ₯ν•œ λ²”μœ„λ‘œλ§Œ 기술.\n\n"
"Cell Type, Cell Line, Assay Name νŠΉμ§• 뢄석 및 SMILES와 μ—°κ²°\n\n"
"RDKit feature 뢄석\n"
"- 각 featureκ°€ μ˜λ―Έν•˜λŠ” 바와 일반적 독성 λ¦¬μŠ€ν¬μ— μ£ΌλŠ” 영ν–₯ μš”μ•½.\n"
"- κ°€λŠ₯ν•œ 경우 Assay λ§₯락(예: ARE μ‚°ν™”μŠ€νŠΈλ ˆμŠ€)κ³Ό μ—°κ²°.\n\n"
"μ’…ν•© νŒλ‹¨(μ΅œμ’… κ²°λ‘ )\n"
"- (1) SMILES λͺ¨ν‹°ν”„, (2) Cell line/Cell type + Assay λ§₯락, (3) RDKit featureλ₯Ό 톡합해\n"
" 독성 μ—¬λΆ€λ₯Ό μ΄μ§„μœΌλ‘œ νŒλ‹¨.\n\n"
"좜λ ₯ κ·œμΉ™:\n"
"- 본문은 μ˜μ–΄λ‘œ μž‘μ„±.\n"
"- λ§ˆμ§€λ§‰ 쀄에 μ•„λž˜ 쀑 ν•˜λ‚˜λ§Œ 단독 ν‘œκΈ°:\n"
"<answer>toxic</answer>\n"
"<answer>nontoxic</answer>\n\n"
)
# 3. μž…λ ₯ 데이터 μ˜ˆμ‹œ
smiles_code = "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
cell_line = "HepG2 (Liver)"
feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
feature_descript = "Detailed feature descriptions"
bio_assay = "AhR"
instruction = "ν™”ν•©λ¬Ό O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2의 독성/비독성 μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜μ‹œμ˜€."
# 4. ν”„λ‘¬ν”„νŠΈ ꡬ성
user_content = (
f"SMILES: {smiles_code}\n"
f"Cell Line: {cell_line}\n"
f"Bio Assay Name: {bio_assay}\n"
f"Feature NL: {feature_NL}\n"
f"Feature Descript: {feature_descript}\n\n"
f"{instruction}"
)
# 5. μ±„νŒ… ν…œν”Œλ¦Ώ 적용 및 생성
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
temperature=0.7,
top_p=0.8,
do_sample=True
)
# 6. κ²°κ³Ό λ””μ½”λ”©
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
## Acknowledgements
λ³Έ 결과물은 κ³Όν•™κΈ°μˆ μ •λ³΄ν†΅μ‹ λΆ€μ™€ ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯μ›μ˜ 지원을 λ°›μ•„ μˆ˜ν–‰λœ
γ€Œ2025λ…„ μ΄ˆκ±°λŒ€AI ν™•μ‚° μƒνƒœκ³„ μ‘°μ„±μ‚¬μ—…γ€μ˜ 연ꡬ μ„±κ³Όμ˜ μΌλΆ€μž…λ‹ˆλ‹€.