Boyue27 commited on
Commit
9e570e6
·
verified ·
1 Parent(s): 4337a51

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # Overview
8
+
9
+ SNOWTEAM/medico-mistral is a specialized language model designed for medical applications. This transformer-based decoder-only language model is based on the Mistral 8x7B model and has been fine-tuned through global parameter adjustments, leveraging a comprehensive dataset that includes 4.8 million research papers and 10,000 medical books.
10
+
11
+ ### Model Description
12
+
13
+ <!-- Provide a longer summary of what this model is. -->
14
+
15
+
16
+
17
+ - **Base Model:** Mistral 8x7B model- Instruct
18
+ - **Model type:** Transformer-based decoder-only language model
19
+ - **Language(s) (NLP):** English
20
+
21
+
22
+ ## Training Dataset
23
+ - **Dataset Size:** 4.8 million research papers and 10,000 medical books.
24
+ - **Data Diversity:** Includes a wide range of medical fields, ensuring comprehensive coverage of medical knowledge.
25
+ - **Preprocessing:**
26
+ - Books: We collected 10,000 textbooks from various sources such as the open-library, university libraries, and reputable publishers, covering a wide range of medical specialties. For preprocessing, we extracted text content from PDF files, then performed data cleaning through de-duplication and content filtering. This involved removing extraneous elements such as URLs, author lists, superfluous information, document contents, references, and citations.
27
+ - Papers: Academic papers are a valuable knowledge resource due to their high-quality, cutting-edge medical information. We started with the S2ORC (Lo et al. 2020) dataset, which contains 81.1 million English-language academic papers. From this, we selected biomedical-related papers based on the presence of corresponding PubMed Central (PMC) IDs. This resulted in approximately 4.8 million biomedical papers, totaling over 75 billion tokens.
28
+
29
+ ### Model Sources [optional]
30
+
31
+ <!-- Provide the basic links for the model. -->
32
+
33
+ - **Repository:** https://huggingface.co/SNOWTEAM/medico-mistral
34
+ - **Paper [optional]:**
35
+ - **Demo [optional]:**
36
+
37
+ ## How to Get Started with the Model
38
+ ```python
39
+ import transformers
40
+ import torch
41
+
42
+ model_path = "SNOWTEAM/medico-mistral"
43
+ model = AutoModelForCausalLM.from_pretrained(
44
+ model_path,device_map="auto",
45
+ max_memory=max_memory_mapping,
46
+ torch_dtype=torch.float16,
47
+ )
48
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
49
+ input_text = ""
50
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
51
+ output_ids = model.generate(input_ids=input_ids.cuda(),
52
+ max_new_tokens=300,
53
+ pad_token_id=tokenizer.eos_token_id,)
54
+ output_text = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:],skip_special_tokens=True)[0]
55
+ print(output_text)
56
+ ```
57
+
58
+ ## Training Details
59
+
60
+
61
+
62
+ #### Training Hyperparameters
63
+
64
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
65
+
66
+ #### Speeds, Sizes, Times [optional]
67
+
68
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
69
+
70
+ ## Evaluation
71
+
72
+
73
+ ### Testing Data, Factors & Metrics
74
+
75
+ #### Testing Data
76
+
77
+ <!-- This should link to a Dataset Card if possible. -->
78
+
79
+ [More Information Needed]
80
+
81
+ #### Factors
82
+
83
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
84
+
85
+ [More Information Needed]
86
+
87
+ #### Metrics
88
+
89
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
90
+
91
+ [More Information Needed]
92
+
93
+ ### Results
94
+
95
+ [More Information Needed]
96
+
97
+ #### Summary
98
+
99
+
100
+ ## Citation [optional]
101
+
102
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
103
+
104
+ **BibTeX:**
105
+
106
+ [More Information Needed]
107
+
108
+ **APA:**
109
+
110
+ [More Information Needed]
111
+
112
+ ## Model Card Authors [optional]
113
+
114
+ [More Information Needed]
115
+
116
+ ## Model Card Contact
117
+
118
+ [More Information Needed]