Quantizations of https://huggingface.co/internlm/internlm2-chat-7b
From original readme
Import from Transformers
To load the InternLM2 7B Chat model using Transformers, use the following code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
The responses can be streamed using stream_chat:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "internlm/internlm2-chat-7b"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()
length = 0
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
print(response[length:], flush=True, end="")
length = len(response)
- Downloads last month
- 71
Hardware compatibility
Log In to add your hardware
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit