SNOWTEAM
/

medico-mistral

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# Overview
+SNOWTEAM/medico-mistral is a specialized language model designed for medical applications. This transformer-based decoder-only language model is based on the Mistral 8x7B model and has been fine-tuned through global parameter adjustments, leveraging a comprehensive dataset that includes 4.8 million research papers and 10,000 medical books.
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Base Model:** Mistral 8x7B model- Instruct
+- **Model type:** Transformer-based decoder-only language model
+- **Language(s) (NLP):** English
+## Training Dataset
+- **Dataset Size:** 4.8 million research papers and 10,000 medical books.
+- **Data Diversity:** Includes a wide range of medical fields, ensuring comprehensive coverage of medical knowledge.
+- **Preprocessing:**
+- Books: We collected 10,000 textbooks from various sources such as the open-library, university libraries, and reputable publishers, covering a wide range of medical specialties. For preprocessing, we extracted text content from PDF files, then performed data cleaning through de-duplication and content filtering. This involved removing extraneous elements such as URLs, author lists, superfluous information, document contents, references, and citations.
+- Papers: Academic papers are a valuable knowledge resource due to their high-quality, cutting-edge medical information. We started with the S2ORC (Lo et al. 2020) dataset, which contains 81.1 million English-language academic papers. From this, we selected biomedical-related papers based on the presence of corresponding PubMed Central (PMC) IDs. This resulted in approximately 4.8 million biomedical papers, totaling over 75 billion tokens.
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** https://huggingface.co/SNOWTEAM/medico-mistral
+- **Paper [optional]:**
+- **Demo [optional]:**
+## How to Get Started with the Model
+```python
+import transformers
+import torch
+model_path = "SNOWTEAM/medico-mistral"
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,device_map="auto",
+    max_memory=max_memory_mapping,
+    torch_dtype=torch.float16,
+)
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
+input_text = ""
+input_ids = tokenizer(input_text, return_tensors="pt").input_ids
+output_ids = model.generate(input_ids=input_ids.cuda(),
+                            max_new_tokens=300,
+                            pad_token_id=tokenizer.eos_token_id,)
+output_text = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:],skip_special_tokens=True)[0]
+print(output_text)
+```
+## Training Details
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]