Question Answering
GGUF
biology
medical
conversational
aashish1904 commited on
Commit
5a28196
·
verified ·
1 Parent(s): 43caa42

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +306 -0
README.md ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ license: apache-2.0
5
+ datasets:
6
+ - FreedomIntelligence/ApolloMoEDataset
7
+ language:
8
+ - ar
9
+ - en
10
+ - zh
11
+ - ko
12
+ - ja
13
+ - mn
14
+ - th
15
+ - vi
16
+ - lo
17
+ - mg
18
+ - de
19
+ - pt
20
+ - es
21
+ - fr
22
+ - ru
23
+ - it
24
+ - hr
25
+ - gl
26
+ - cs
27
+ - co
28
+ - la
29
+ - uk
30
+ - bs
31
+ - bg
32
+ - eo
33
+ - sq
34
+ - da
35
+ - sa
36
+ - gn
37
+ - sr
38
+ - sk
39
+ - gd
40
+ - lb
41
+ - hi
42
+ - ku
43
+ - mt
44
+ - he
45
+ - ln
46
+ - bm
47
+ - sw
48
+ - ig
49
+ - rw
50
+ - ha
51
+ metrics:
52
+ - accuracy
53
+ base_model:
54
+ - Qwen/Qwen2-7B
55
+ pipeline_tag: question-answering
56
+ tags:
57
+ - biology
58
+ - medical
59
+
60
+ ---
61
+
62
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
63
+
64
+
65
+ # QuantFactory/Apollo2-7B-GGUF
66
+ This is quantized version of [FreedomIntelligence/Apollo2-7B](https://huggingface.co/FreedomIntelligence/Apollo2-7B) created using llama.cpp
67
+
68
+ # Original Model Card
69
+
70
+ # Democratizing Medical LLMs For Much More Languages
71
+
72
+ Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.
73
+
74
+
75
+
76
+ <p align="center">
77
+ 📃 <a href="https://arxiv.org/abs/2410.10626" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> • 🤗 <a href="https://huggingface.co/collections/FreedomIntelligence/apollomoe-and-apollo2-670ddebe3bb1ba1aebabbf2c" target="_blank">Models</a> •🌐 <a href="https://github.com/FreedomIntelligence/Apollo" target="_blank">Apollo</a> • 🌐 <a href="https://github.com/FreedomIntelligence/ApolloMoE" target="_blank">ApolloMoE</a>
78
+ </p>
79
+
80
+
81
+
82
+ ![Apollo](assets/apollo_medium_final.png)
83
+
84
+
85
+ ## 🌈 Update
86
+
87
+ * **[2024.10.15]** ApolloMoE repo is published!🎉
88
+
89
+
90
+ ## Languages Coverage
91
+ 12 Major Languages and 38 Minor Languages
92
+
93
+ <details>
94
+ <summary>Click to view the Languages Coverage</summary>
95
+
96
+ ![ApolloMoE](assets/languages.png)
97
+
98
+ </details>
99
+
100
+
101
+ ## Architecture
102
+
103
+ <details>
104
+ <summary>Click to view the MoE routing image</summary>
105
+
106
+ ![ApolloMoE](assets/hybrid_routing.png)
107
+
108
+ </details>
109
+
110
+ ## Results
111
+
112
+ #### Dense
113
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-0.5B" target="_blank">Apollo2-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-1.5B" target="_blank">Apollo2-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-2B" target="_blank">Apollo2-2B</a>
114
+
115
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-3.8B" target="_blank">Apollo2-3.8B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-7B" target="_blank">Apollo2-7B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-9B" target="_blank">Apollo2-9B</a>
116
+
117
+ <details>
118
+ <summary>Click to view the Dense Models Results</summary>
119
+
120
+ ![ApolloMoE](assets/dense_results.png)
121
+
122
+ </details>
123
+
124
+
125
+ #### Post-MoE
126
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-0.5B" target="_blank">Apollo-MoE-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-1.5B" target="_blank">Apollo-MoE-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-7B" target="_blank">Apollo-MoE-7B</a>
127
+
128
+ <details>
129
+ <summary>Click to view the Post-MoE Models Results</summary>
130
+
131
+ ![ApolloMoE](assets/post_moe_results.png)
132
+
133
+ </details>
134
+
135
+
136
+
137
+
138
+ ## Usage Format
139
+ ##### Apollo2
140
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
141
+ - 2B, 9B: User:{query}\nAssistant:{response}\<eos\>
142
+ - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|>
143
+
144
+ ##### Apollo-MoE
145
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
146
+
147
+ ## Dataset & Evaluation
148
+
149
+ - Dataset
150
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a>
151
+
152
+ <details><summary>Click to expand</summary>
153
+
154
+ ![ApolloMoE](assets/Dataset.png)
155
+
156
+ - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)
157
+
158
+
159
+ </details>
160
+
161
+ - Evaluation
162
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a>
163
+
164
+ <details><summary>Click to expand</summary>
165
+
166
+ - EN:
167
+ - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)
168
+ - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)
169
+ - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.
170
+ - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)
171
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
172
+ - ZH:
173
+ - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)
174
+ - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper
175
+ - Randomly sample 2,000 multiple-choice questions with single answer.
176
+ - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)
177
+ - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology
178
+ - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper
179
+ - Randomly sample 2,000 multiple-choice questions
180
+
181
+
182
+ - ES: [Head_qa](https://huggingface.co/datasets/head_qa)
183
+ - FR:
184
+ - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)
185
+ - [MMLU_FR]
186
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
187
+ - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)
188
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
189
+ - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)
190
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
191
+ - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA)
192
+ - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA)
193
+ - IT:
194
+ - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA)
195
+ - [MMLU_IT]
196
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
197
+ - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part
198
+ - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part
199
+ - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)
200
+
201
+
202
+
203
+
204
+ </details>
205
+ ## Model Download and Inference
206
+ We take Apollo-MoE-0.5B as an example
207
+ 1. Login Huggingface
208
+
209
+ ```
210
+ huggingface-cli login --token $HUGGINGFACE_TOKEN
211
+ ```
212
+
213
+ 2. Download model to local dir
214
+
215
+ ```python
216
+ from huggingface_hub import snapshot_download
217
+ import os
218
+
219
+ local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B')
220
+ snapshot_download(repo_id="FreedomIntelligence/Apollo-MoE-0.5B", local_dir=local_model_dir)
221
+ ```
222
+
223
+ 3. Inference Example
224
+
225
+ ```python
226
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
227
+ import os
228
+
229
+ local_model_dir=os.path.join('/path/to/models/dir','Apollo-MoE-0.5B')
230
+
231
+ model=AutoModelForCausalLM.from_pretrained(local_model_dir,trust_remote_code=True)
232
+ tokenizer = AutoTokenizer.from_pretrained(local_model_dir,trust_remote_code=True)
233
+ generation_config = GenerationConfig.from_pretrained(local_model_dir, pad_token_id=tokenizer.pad_token_id, num_return_sequences=1, max_new_tokens=7, min_new_tokens=2, do_sample=False, temperature=1.0, top_k=50, top_p=1.0)
234
+
235
+ inputs = tokenizer('Answer direclty.\nThe capital of Mongolia is Ulaanbaatar.\nThe capital of Iceland is Reykjavik.\nThe capital of Australia is', return_tensors='pt')
236
+ inputs = inputs.to(model.device)
237
+ pred = model.generate(**inputs,generation_config=generation_config)
238
+ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
239
+ ```
240
+
241
+ ## Results reproduction
242
+ <details><summary>Click to expand</summary>
243
+
244
+
245
+ We take Apollo2-7B or Apollo-MoE-0.5B as example
246
+ 1. Download Dataset for project:
247
+
248
+ ```
249
+ bash 0.download_data.sh 
250
+ ```
251
+
252
+ 2. Prepare test and dev data for specific model:
253
+
254
+
255
+ - Create test data for with special token
256
+
257
+ ```
258
+ bash 1.data_process_test&dev.sh
259
+ ```
260
+
261
+ 3. Prepare train data for specific model (Create tokenized data in advance):
262
+
263
+
264
+ - You can adjust data Training order and Training Epoch in this step
265
+
266
+ ```
267
+ bash 2.data_process_train.sh
268
+ ```
269
+
270
+ 4. Train the model
271
+
272
+
273
+ - If you want to train in Multi Nodes please refer to ./src/sft/training_config/zero_multi.yaml
274
+
275
+
276
+ ```
277
+ bash 3.single_node_train.sh
278
+ ```
279
+
280
+
281
+ 5. Evaluate your model: Generate score for benchmark
282
+
283
+ ```
284
+ bash 4.eval.sh
285
+ ```
286
+
287
+ </details>
288
+
289
+
290
+
291
+ ## Citation
292
+ Please use the following citation if you intend to use our dataset for training or evaluation:
293
+
294
+ ```
295
+ @misc{zheng2024efficientlydemocratizingmedicalllms,
296
+ title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts},
297
+ author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang},
298
+ year={2024},
299
+ eprint={2410.10626},
300
+ archivePrefix={arXiv},
301
+ primaryClass={cs.CL},
302
+ url={https://arxiv.org/abs/2410.10626},
303
+ }
304
+ ```
305
+
306
+