UNIVA-Jason commited on
Commit
1734218
ยท
verified ยท
1 Parent(s): ee7dced

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ tags:
6
+ - chemistry
7
+ - biology
8
+ - toxicology
9
+ license: apache-2.0
10
+ base_model: Qwen/Qwen3-14B
11
+ ---
12
+
13
+ # Blowfish
14
+
15
+ ## Introduction
16
+
17
+ **Blowfish๋Š” **๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก**์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋œ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์ž…๋‹ˆ๋‹ค.
18
+
19
+ **Qwen3-14B**๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ์ธํŠœ๋‹(Fine-tuning)๋˜์—ˆ์œผ๋ฉฐ, ๋‹จ์ˆœํ•œ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ๋„˜์–ด **Chain-of-Thought (CoT)** ๋ฐฉ์‹์„ ํ†ตํ•ด ๋…์„ฑ ํŒ์ •์˜ ํ™”ํ•™์ /์ƒ๋ฌผํ•™์  ๊ทผ๊ฑฐ๋ฅผ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
20
+
21
+ ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ **SMILES**, **Cell Line**, **Bio Assay**, ๊ทธ๋ฆฌ๊ณ  **์ฃผ์š” RDKit Features**์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋…์„ฑ ์—ฌ๋ถ€(`๋…์„ฑ` / `๋น„๋…์„ฑ`)๋ฅผ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.
22
+
23
+ ### ์ฃผ์š” ํŠน์ง•
24
+ * **Base Model:** Qwen3-14B
25
+ * **Task:** ์ด์ง„ ๋…์„ฑ ์˜ˆ์ธก (Binary Toxicity Prediction) ๋ฐ ๋ถ„์ž ๊ตฌ์กฐ ๋ถ„์„
26
+ * **Language:** ํ•œ๊ตญ์–ด (์‹œ์Šคํ…œ ์ง€์‹œ๋ฌธ), ์˜์–ด (ํ™”ํ•™์  ์ถ”๋ก  ๋ฐ ๋‹ต๋ณ€)
27
+ * **Input Data:**
28
+ - SMILES Code
29
+ - Cell Line / Cell Type
30
+ - Bio Assay Name
31
+ - RDKit Features (SHAP Value ๊ธฐ์ค€ ์ƒ/ํ•˜์œ„ Feature ๊ฐ 3๊ฐœ)
32
+
33
+ ---
34
+
35
+ ## ํ”„๋กฌํ”„ํŠธ ํ˜•์‹
36
+
37
+ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ์‹œ ์‚ฌ์šฉ๋œ ํ”„๋กฌํ”„ํŠธ ํ˜•์‹์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
38
+
39
+ ### ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ (System Prompt)
40
+ > "๋‹น์‹ ์€ ๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก์— ํŠนํ™”๋œ ํ™”ํ•™์ •๋ณดํ•™/๋…์„ฑํ•™ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋…์„ฑ/๋น„๋…์„ฑ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ผ์น˜๋Š” Feature 3๊ฐœ์”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค... (์ค‘๋žต) ... tool call์„ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์„ธ์š”."
41
+
42
+ ### ์‚ฌ์šฉ์ž ์ž…๋ ฅ ํ…œํ”Œ๋ฆฟ (User Input Template)
43
+
44
+ ```
45
+ SMILES: {smiles_code}
46
+ Cell Line: {cell_line}
47
+ Bio Assay Name: {endpoint_category}
48
+ Feature NL: {feature_NL_description}
49
+ Feature Descript: {feature_detailed_description}
50
+
51
+ {cot_instruction}
52
+ ```
53
+
54
+ ---
55
+
56
+ # Inference
57
+
58
+ ## requirements
59
+ ```bash
60
+ pip install transformers torch accelerate
61
+ ```
62
+
63
+ ## Usage with transformers
64
+
65
+ ```python
66
+ import torch
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer
68
+
69
+ # 1. ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ
70
+ model_id = "TeamUNIVA/Blowfish"
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
73
+ model = AutoModelForCausalLM.from_pretrained(
74
+ model_id,
75
+ torch_dtype=torch.bfloat16,
76
+ device_map="auto"
77
+ )
78
+
79
+ # 2. ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ์ •์˜
80
+ system_prompt = (
81
+ "๋‹น์‹ ์€ ๋ถ„์ž ๋…์„ฑ ์˜ˆ์ธก์— ํŠนํ™”๋œ ํ™”ํ•™์ •๋ณดํ•™/๋…์„ฑํ•™ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.\n"
82
+ "์‚ฌ์šฉ์ž๋Š” ๋…์„ฑ/๋น„๋…์„ฑ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ผ์น˜๋Š” Feature 3๊ฐœ์”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.\n\n"
83
+ "์ž…๋ ฅ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๊ณต):\n"
84
+ "- SMILES\n- Cell Type\n- Cell Line\n- Bio Assay Name\n"
85
+ "- ๋…์„ฑ์— ๋ผ์น˜๋Š” ์˜ํ–ฅ์ด ํฐ ์ƒ์œ„ 3๊ฐœ RDKit Feature\n"
86
+ "- ๋น„๋…์„ฑ์— ๋ผ์น˜๋Š” ์˜ํ–ฅ์ด ํฐ ์ƒ์œ„ 3๊ฐœ RDKit Feature\n\n"
87
+ "์ˆ˜ํ–‰ ๊ณผ์—…(Tasks):\n"
88
+ "SMILES ๊ตฌ์กฐ ๋ถ„์„\n"
89
+ "- ๊ณ ๋ฆฌ(๋ฐฉํ–ฅ์กฑ/์ง€๋ฐฉ์กฑ), ํ—คํ…Œ๋กœ์›์ž, ์ „ํ•˜ ์ค‘์‹ฌ, ๋ฐ˜์‘์„ฑ ๋ชจํ‹ฐํ”„, H-๊ฒฐํ•ฉ ๊ณต์—ฌ/์ˆ˜์šฉ๊ธฐ ๋“ฑ์„\n"
90
+ " SMILES์—์„œ ์ง์ ‘ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„๋กœ๋งŒ ๊ธฐ์ˆ .\n\n"
91
+ "Cell Type, Cell Line, Assay Name ํŠน์ง• ๋ถ„์„ ๋ฐ SMILES์™€ ์—ฐ๊ฒฐ\n\n"
92
+ "RDKit feature ๋ถ„์„\n"
93
+ "- ๊ฐ feature๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”์™€ ์ผ๋ฐ˜์  ๋…์„ฑ ๋ฆฌ์Šคํฌ์— ์ฃผ๋Š” ์˜ํ–ฅ ์š”์•ฝ.\n"
94
+ "- ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ Assay ๋งฅ๋ฝ(์˜ˆ: ARE ์‚ฐํ™”์ŠคํŠธ๋ ˆ์Šค)๊ณผ ์—ฐ๊ฒฐ.\n\n"
95
+ "์ข…ํ•ฉ ํŒ๋‹จ(์ตœ์ข… ๊ฒฐ๋ก )\n"
96
+ "- (1) SMILES ๋ชจํ‹ฐํ”„, (2) Cell line/Cell type + Assay ๋งฅ๋ฝ, (3) RDKit feature๋ฅผ ํ†ตํ•ฉํ•ด\n"
97
+ " ๋…์„ฑ ์—ฌ๋ถ€๋ฅผ ์ด์ง„์œผ๋กœ ํŒ๋‹จ.\n\n"
98
+ "์ถœ๋ ฅ ๊ทœ์น™:\n"
99
+ "- ๋ณธ๋ฌธ์€ ์˜์–ด๋กœ ์ž‘์„ฑ.\n"
100
+ "- ๋งˆ์ง€๋ง‰ ์ค„์— ์•„๋ž˜ ์ค‘ ํ•˜๋‚˜๋งŒ ๋‹จ๋… ํ‘œ๊ธฐ:\n"
101
+ "<answer>toxic</answer>\n"
102
+ "<answer>nontoxic</answer>\n\n"
103
+ )
104
+
105
+ # 3. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์˜ˆ์‹œ
106
+ smiles_code = "O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2"
107
+ cell_line = "HepG2 (Liver)"
108
+ feature_NL = "Top toxic features: ... / Top non-toxic features: ...",
109
+ feature_descript = "Detailed feature descriptions"
110
+ bio_assay = "AhR"
111
+ instruction = "ํ™”ํ•ฉ๋ฌผ O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.O=C(O)/C=C/C(=O)O.CN(C)CCN(Cc1cccs1)c2ccccn2.CN(C)CCN(Cc1cccs1)c2ccccn2์˜ ๋…์„ฑ/๋น„๋…์„ฑ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜์‹œ์˜ค."
112
+
113
+
114
+ # 4. ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ
115
+ user_content = (
116
+ f"SMILES: {smiles_code}\n"
117
+ f"Cell Line: {cell_line}\n"
118
+ f"Bio Assay Name: {bio_assay}\n"
119
+ f"Feature NL: {feature_NL}\n"
120
+ f"Feature Descript: {feature_descript}\n\n"
121
+ f"{instruction}"
122
+ )
123
+
124
+ # 5. ์ฑ„ํŒ… ํ…œํ”Œ๋ฆฟ ์ ์šฉ ๋ฐ ์ƒ์„ฑ
125
+ messages = [
126
+ {"role": "system", "content": system_prompt},
127
+ {"role": "user", "content": user_content}
128
+ ]
129
+
130
+ text = tokenizer.apply_chat_template(
131
+ messages,
132
+ tokenize=False,
133
+ add_generation_prompt=True
134
+ )
135
+
136
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
137
+
138
+ outputs = model.generate(
139
+ **inputs,
140
+ max_new_tokens=8192,
141
+ temperature=0.7,
142
+ top_p=0.8,
143
+ do_sample=True
144
+ )
145
+
146
+ # 6. ๊ฒฐ๊ณผ ๋””์ฝ”๋”ฉ
147
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
148
+ print(response)
149
+ ```
150
+
151
+ ## Acknowledgements
152
+ ๋ณธ ๊ฒฐ๊ณผ๋ฌผ์€ ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€์™€ ํ•œ๊ตญ์ง€๋Šฅ์ •๋ณด์‚ฌํšŒ์ง„ํฅ์›์˜ ์ง€์›์„ ๋ฐ›์•„ ์ˆ˜ํ–‰๋œ
153
+ ใ€Œ2025๋…„ ์ดˆ๊ฑฐ๋Œ€AI ํ™•์‚ฐ ์ƒํƒœ๊ณ„ ์กฐ์„ฑ์‚ฌ์—…ใ€์˜ ์—ฐ๊ตฌ ์„ฑ๊ณผ์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค.