--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards {} --- # Overview SNOWTEAM/medico-mistral is a specialized language model designed for medical applications. This transformer-based decoder-only language model is based on the Mistral 8x7B model and has been fine-tuned through global parameter adjustments, leveraging a comprehensive dataset that includes 4.8 million research papers and 10,000 medical books. ### Model Description - **Base Model:** [Mistral 8x7B model- Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) - **Model type:** Transformer-based decoder-only language model - **Language(s) (NLP):** English ## Training Dataset - **Dataset Size:** 4.8 million research papers and 10,000 medical books. - **Data Diversity:** Includes a wide range of medical fields, ensuring comprehensive coverage of medical knowledge. - **Preprocessing:** - Books: We collected 10,000 textbooks from various sources such as the open-library, university libraries, and reputable publishers, covering a wide range of medical specialties. For preprocessing, we extracted text content from PDF files, then performed data cleaning through de-duplication and content filtering. This involved removing extraneous elements such as URLs, author lists, superfluous information, document contents, references, and citations. - Papers: Academic papers are a valuable knowledge resource due to their high-quality, cutting-edge medical information. We started with the S2ORC (Lo et al. 2020) dataset, which contains 81.1 million English-language academic papers. From this, we selected biomedical-related papers based on the presence of corresponding PubMed Central (PMC) IDs. This resulted in approximately 4.8 million biomedical papers, totaling over 75 billion tokens. ### Model Sources [optional] - **Repository:** https://huggingface.co/SNOWTEAM/medico-mistral - **Paper [optional]:** - **Demo [optional]:** ## How to Get Started with the Model ```python import transformers import torch model_path = "SNOWTEAM/medico-mistral" model = AutoModelForCausalLM.from_pretrained( model_path,device_map="auto", max_memory=max_memory_mapping, torch_dtype=torch.float16, ) tokenizer = AutoTokenizer.from_pretrained("SNOWTEAM/medico-mistral") input_text = "" input_ids = tokenizer(input_text, return_tensors="pt").input_ids output_ids = model.generate(input_ids=input_ids.cuda(), max_new_tokens=300, pad_token_id=tokenizer.eos_token_id,) output_text = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:],skip_special_tokens=True)[0] print(output_text) ``` ## Training Details #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]