|
|
--- |
|
|
language: |
|
|
- ko |
|
|
- en |
|
|
tags: |
|
|
- chemistry |
|
|
- biology |
|
|
- toxicology |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-14B |
|
|
--- |
|
|
|
|
|
# Blowfish |
|
|
|
|
|
## Introduction |
|
|
|
|
|
**Blowfishλ **λΆμ λ
μ± μμΈ‘**μ μννκΈ° μν΄ κ°λ°λ λν μΈμ΄ λͺ¨λΈ(LLM)μ
λλ€. |
|
|
|
|
|
**Qwen3-14B**λ₯Ό κΈ°λ°μΌλ‘ νμΈνλ(Fine-tuning)λμμΌλ©°, λ¨μν μ΄μ§ λΆλ₯λ₯Ό λμ΄ **Chain-of-Thought (CoT)** λ°©μμ ν΅ν΄ λ
μ± νμ μ ννμ /μλ¬Όνμ κ·Όκ±°λ₯Ό λ
Όλ¦¬μ μΌλ‘ μ€λͺ
ν©λλ€. |
|
|
|
|
|
μ¬μ©μκ° μ
λ ₯ν **SMILES**, **Cell Line**, **Bio Assay**, κ·Έλ¦¬κ³ **μ£Όμ RDKit Features**μ μ’
ν©μ μΌλ‘ λΆμνμ¬ μ΅μ’
μ μΌλ‘ λ
μ± μ¬λΆ(`λ
μ±` / `λΉλ
μ±`)λ₯Ό νλ¨ν©λλ€. |
|
|
|
|
|
### μ£Όμ νΉμ§ |
|
|
* **Base Model:** Qwen3-14B |
|
|
* **Task:** μ΄μ§ λ
μ± μμΈ‘ (Binary Toxicity Prediction) λ° λΆμ ꡬ쑰 λΆμ |
|
|
* **Language:** νκ΅μ΄ (μμ€ν
μ§μλ¬Έ), μμ΄ (ννμ μΆλ‘ λ° λ΅λ³) |
|
|
* **Input Data:** |
|
|
- SMILES Code |
|
|
- Cell Line / Cell Type |
|
|
- Bio Assay Name |
|
|
- RDKit Features (SHAP Value κΈ°μ€ μ/νμ Feature κ° 3κ°) |
|
|
|
|
|
--- |
|
|
|
|
|
## ν둬ννΈ νμ |
|
|
|
|
|
λͺ¨λΈμ μ±λ₯μ μ΅μ ννκΈ° μν΄ νμ΅ μ μ¬μ©λ ν둬ννΈ νμμ μ€μν΄μΌ ν©λλ€. |
|
|
|
|
|
### μμ€ν
ν둬ννΈ (System Prompt) |
|
|
> "λΉμ μ λΆμ λ
μ± μμΈ‘μ νΉνλ ννμ 보ν/λ
μ±ν μ λ¬Έκ°μ
λλ€. μ¬μ©μλ λ
μ±/λΉλ
μ±μ μν₯μ λ§μ΄ λΌμΉλ Feature 3κ°μ©μ μ 곡ν©λλ€... (μ€λ΅) ... tool callμ μ¬μ©νμ§ λ§μΈμ." |
|
|
|
|
|
### μ¬μ©μ μ
λ ₯ ν
νλ¦Ώ (User Input Template) |
|
|
|
|
|
``` |
|
|
SMILES: {smiles_code} |
|
|
Cell Line: {cell_line} |
|
|
Bio Assay Name: {endpoint_category} |
|
|
Feature NL: {feature_NL_description} |
|
|
Feature Descript: {feature_detailed_description} |
|
|
|
|
|
{cot_instruction} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# Inference |
|
|
|
|
|
## requirements |
|
|
```bash |
|
|
pip install transformers torch accelerate |
|
|
``` |
|
|
|
|
|
## Usage with transformers |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# 1. λͺ¨λΈ λ° ν ν¬λμ΄μ λ‘λ |
|
|
model_id = "TeamUNIVA/Blowfish" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# 2. μμ€ν
ν둬ννΈ μ μ |
|
|
system_prompt = ( |
|
|
"λΉμ μ λΆμ λ
μ± μμΈ‘μ νΉνλ ννμ 보ν/λ
μ±ν μ λ¬Έκ°μ
λλ€.\n" |
|
|
"μ¬μ©μλ λ
μ±/λΉλ
μ±μ μν₯μ λ§μ΄ λΌμΉλ Feature 3κ°μ©μ μ 곡ν©λλ€.\n\n" |
|
|
"μ
λ ₯(μ¬μ©μκ° μ 곡):\n" |
|
|
"- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n" |
|
|
"- λ
μ±μ λΌμΉλ μν₯μ΄ ν° μμ 3κ° RDKit Feature\n" |
|
|
"- λΉλ
μ±μ λΌμΉλ μν₯μ΄ ν° μμ 3κ° RDKit Feature\n\n" |
|
|
"μν κ³Όμ
(Tasks):\n" |
|
|
"SMILES ꡬ쑰 λΆμ\n" |
|
|
"- κ³ λ¦¬(λ°©ν₯μ‘±/μ§λ°©μ‘±), ν€ν
λ‘μμ, μ ν μ€μ¬, λ°μμ± λͺ¨ν°ν, H-κ²°ν© κ³΅μ¬/μμ©κΈ° λ±μ\n" |
|
|
" SMILESμμ μ§μ κ΄μ°° κ°λ₯ν λ²μλ‘λ§ κΈ°μ .\n\n" |
|
|
"Cell Type, Cell Line, Assay Name νΉμ§ λΆμ λ° SMILESμ μ°κ²°\n\n" |
|
|
"RDKit feature λΆμ\n" |
|
|
"- κ° featureκ° μλ―Ένλ λ°μ μΌλ°μ λ
μ± λ¦¬μ€ν¬μ μ£Όλ μν₯ μμ½.\n" |
|
|
"- κ°λ₯ν κ²½μ° Assay λ§₯λ½(μ: ARE μ°νμ€νΈλ μ€)κ³Ό μ°κ²°.\n\n" |
|
|
"μ’
ν© νλ¨(μ΅μ’
κ²°λ‘ )\n" |
|
|
"- (1) SMILES λͺ¨ν°ν, (2) Cell line/Cell type + Assay λ§₯λ½, (3) RDKit featureλ₯Ό ν΅ν©ν΄\n" |
|
|
" λ
μ± μ¬λΆλ₯Ό μ΄μ§μΌλ‘ νλ¨.\n\n" |
|
|
"μΆλ ₯ κ·μΉ:\n" |
|
|
"- λ³Έλ¬Έμ μμ΄λ‘ μμ±.\n" |
|
|
"- λ§μ§λ§ μ€μ μλ μ€ νλλ§ λ¨λ
νκΈ°:\n" |
|
|
"<answer>toxic</answer>\n" |
|
|
"<answer>nontoxic</answer>\n\n" |
|
|
) |
|
|
|
|
|
# 3. μ
λ ₯ λ°μ΄ν° μμ |
|
|
smiles_code = "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2" |
|
|
cell_line = "HepG2 (Liver)" |
|
|
feature_NL = "Top toxic features: ... / Top non-toxic features: ...", |
|
|
feature_descript = "Detailed feature descriptions" |
|
|
bio_assay = "AhR" |
|
|
instruction = "νν©λ¬Ό O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2μ λ
μ±/λΉλ
μ± μ¬λΆλ₯Ό νλ¨νμμ€." |
|
|
|
|
|
|
|
|
# 4. ν둬ννΈ κ΅¬μ± |
|
|
user_content = ( |
|
|
f"SMILES: {smiles_code}\n" |
|
|
f"Cell Line: {cell_line}\n" |
|
|
f"Bio Assay Name: {bio_assay}\n" |
|
|
f"Feature NL: {feature_NL}\n" |
|
|
f"Feature Descript: {feature_descript}\n\n" |
|
|
f"{instruction}" |
|
|
) |
|
|
|
|
|
# 5. μ±ν
ν
νλ¦Ώ μ μ© λ° μμ± |
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_content} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=8192, |
|
|
temperature=0.7, |
|
|
top_p=0.8, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
# 6. κ²°κ³Ό λμ½λ© |
|
|
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Acknowledgements |
|
|
λ³Έ κ²°κ³Όλ¬Όμ κ³ΌνκΈ°μ μ 보ν΅μ λΆμ νκ΅μ§λ₯μ 보μ¬νμ§ν₯μμ μ§μμ λ°μ μνλ |
|
|
γ2025λ
μ΄κ±°λAI νμ° μνκ³ μ‘°μ±μ¬μ
γμ μ°κ΅¬ μ±κ³Όμ μΌλΆμ
λλ€. |