Instructions to use AI4Chem/ChemLLM-7B-Chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AI4Chem/ChemLLM-7B-Chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AI4Chem/ChemLLM-7B-Chat", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AI4Chem/ChemLLM-7B-Chat", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AI4Chem/ChemLLM-7B-Chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AI4Chem/ChemLLM-7B-Chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4Chem/ChemLLM-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AI4Chem/ChemLLM-7B-Chat
- SGLang
How to use AI4Chem/ChemLLM-7B-Chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AI4Chem/ChemLLM-7B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4Chem/ChemLLM-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AI4Chem/ChemLLM-7B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI4Chem/ChemLLM-7B-Chat", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AI4Chem/ChemLLM-7B-Chat with Docker Model Runner:
docker model run hf.co/AI4Chem/ChemLLM-7B-Chat
SFT dataset details
In Figure 2: Main Pipeline of Model Training, we see 7M Q&A in chemical corpora Chemdata, however in Appedix C hyperparameters, we see ChemData with 70.2 million entries. Can you explain for my confusion?
Thanks for your comment!
We're still working on progress for ChemData and ChemLLM.
There would be some typos in our preprint.
The correct capacity info should be shown in Table 4: Statics of Instruction Datasets.
This number would be changing in the future according to our data processing strategy settings.
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?
There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.
I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.
Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.
Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!
I'm very instrested in your work. Thanks!
Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.
Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!I'm very instrested in your work. Thanks!
Thanks for your insightful comments!