Instructions to use Rookie/Llama-3-8B-Instruct-Chinese with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rookie/Llama-3-8B-Instruct-Chinese with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Rookie/Llama-3-8B-Instruct-Chinese") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Rookie/Llama-3-8B-Instruct-Chinese") model = AutoModelForCausalLM.from_pretrained("Rookie/Llama-3-8B-Instruct-Chinese") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use Rookie/Llama-3-8B-Instruct-Chinese with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Rookie/Llama-3-8B-Instruct-Chinese", filename="ggml-model-Q5_1.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Rookie/Llama-3-8B-Instruct-Chinese with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1 # Run inference directly in the terminal: llama-cli -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1 # Run inference directly in the terminal: llama-cli -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1 # Run inference directly in the terminal: ./llama-cli -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
Use Docker
docker model run hf.co/Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
- LM Studio
- Jan
- vLLM
How to use Rookie/Llama-3-8B-Instruct-Chinese with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Rookie/Llama-3-8B-Instruct-Chinese" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rookie/Llama-3-8B-Instruct-Chinese", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
- SGLang
How to use Rookie/Llama-3-8B-Instruct-Chinese with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Rookie/Llama-3-8B-Instruct-Chinese" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rookie/Llama-3-8B-Instruct-Chinese", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Rookie/Llama-3-8B-Instruct-Chinese" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Rookie/Llama-3-8B-Instruct-Chinese", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use Rookie/Llama-3-8B-Instruct-Chinese with Ollama:
ollama run hf.co/Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
- Unsloth Studio new
How to use Rookie/Llama-3-8B-Instruct-Chinese with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Rookie/Llama-3-8B-Instruct-Chinese to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Rookie/Llama-3-8B-Instruct-Chinese to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Rookie/Llama-3-8B-Instruct-Chinese to start chatting
- Docker Model Runner
How to use Rookie/Llama-3-8B-Instruct-Chinese with Docker Model Runner:
docker model run hf.co/Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
- Lemonade
How to use Rookie/Llama-3-8B-Instruct-Chinese with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Rookie/Llama-3-8B-Instruct-Chinese:Q5_1
Run and chat with the model
lemonade run user.Llama-3-8B-Instruct-Chinese-Q5_1
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Llama-3-8B-Instruct-Chinese-chat
Llama-3-8B-Instruct in Chinese 自己微调版本
训练可用数据整理
| 数据集 | 介绍 |
|---|---|
| firefly-train-1.1M | 包含了23种常见的中文NLP任务的数据,并且构造了许多与中华文化相关的数据,如对联、作诗、文言文翻译、散文、金庸小说等。对于每个任务,由人工书写若干种指令模板,保证数据的高质量与丰富度,数据量为115万。 |
| moss-003-sft-data | 由复旦大学MOSS团队开源的中英文多轮对话数据,包含100万+数本。 |
| school_math_0.25M | 由BELLE项目组开源的数学运算指令数据,包含25万条数问。 |
| ruozhiba | 弱智吧数据问答,据说比较锻炼模型的心智能力。 |
| 欢迎补充,要求中文且一问一答形式,适合用于提升llama3任务能力的数据集 |
github地址
推荐微调工具
在此感谢以下项目,提供了许多优秀的中文微调工具,供大家参考:
- Firefly - https://github.com/yangjianxin1/Firefly
- LLaMA-Factory - https://github.com/hiyouga/LLaMA-Factory.git
Chat版模型下载
- Instruct + 继续中文sft版
- huggingface地址
模型量化加速、部署
模型使用
默认情况下直接运行以下代码即可体验llama3中文对话,请自行修改model_name_or_path为你下载的模型路径
from transformers import AutoTokenizer, AutoConfig, AddedToken, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from dataclasses import dataclass
from typing import Dict
import torch
import copy
## 定义聊天模板
@dataclass
class Template:
template_name:str
system_format: str
user_format: str
assistant_format: str
system: str
stop_word: str
template_dict: Dict[str, Template] = dict()
def register_template(template_name, system_format, user_format, assistant_format, system, stop_word=None):
template_dict[template_name] = Template(
template_name=template_name,
system_format=system_format,
user_format=user_format,
assistant_format=assistant_format,
system=system,
stop_word=stop_word,
)
# 这里的系统提示词是训练时使用的,推理时可以自行尝试修改效果
register_template(
template_name='llama3',
system_format='<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{content}<|eot_id|>',
user_format='<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n',
assistant_format='{content}<|eot_id|>',
system=None,
stop_word='<|eot_id|>'
)
## 加载模型
def load_model(model_name_or_path, load_in_4bit=False, adapter_name_or_path=None):
if load_in_4bit:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
)
else:
quantization_config = None
# 加载base model
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
load_in_4bit=load_in_4bit,
trust_remote_code=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map='auto',
quantization_config=quantization_config
)
# 加载adapter
if adapter_name_or_path is not None:
model = PeftModel.from_pretrained(model, adapter_name_or_path)
return model
## 加载tokenzier
def load_tokenizer(model_name_or_path):
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
trust_remote_code=True,
use_fast=False
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
return tokenizer
## 构建prompt
def build_prompt(tokenizer, template, query, history, system=None):
template_name = template.template_name
system_format = template.system_format
user_format = template.user_format
assistant_format = template.assistant_format
system = system if system is not None else template.system
history.append({"role": 'user', 'message': query})
input_ids = []
# 添加系统信息
if system_format is not None:
if system is not None:
system_text = system_format.format(content=system)
input_ids = tokenizer.encode(system_text, add_special_tokens=False)
# 拼接历史对话
for item in history:
role, message = item['role'], item['message']
if role == 'user':
message = user_format.format(content=message, stop_token=tokenizer.eos_token)
else:
message = assistant_format.format(content=message, stop_token=tokenizer.eos_token)
tokens = tokenizer.encode(message, add_special_tokens=False)
input_ids += tokens
input_ids = torch.tensor([input_ids], dtype=torch.long)
return input_ids
def main():
model_name_or_path = 'NousResearch/Meta-Llama-3-8B'
template_name = 'llama3'
adapter_name_or_path = None
template = template_dict[template_name]
load_in_4bit = False
max_new_tokens = 500
top_p = 0.9
temperature = 0.35
repetition_penalty = 1.1
# 加载模型
print(f'Loading model from: {model_name_or_path}')
print(f'adapter_name_or_path: {adapter_name_or_path}')
model = load_model(
model_name_or_path,
load_in_4bit=load_in_4bit,
adapter_name_or_path=adapter_name_or_path
).eval()
tokenizer = load_tokenizer(model_name_or_path if adapter_name_or_path is None else adapter_name_or_path)
if template.stop_word is None:
template.stop_word = tokenizer.eos_token
stop_token_id = tokenizer.encode(template.stop_word, add_special_tokens=True)
assert len(stop_token_id) == 1
stop_token_id = stop_token_id[0]
history = []
query = input('# User:')
while True:
query = query.strip()
input_ids = build_prompt(tokenizer, template, query, copy.deepcopy(history), system=None).to(model.device)
attention_mask = torch.ones(input_ids.shape,dtype=torch.long,device=model.device)
outputs = model.generate(
input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=True,
top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty,
eos_token_id=stop_token_id,
attention_mask=attention_mask,
pad_token_id=tokenizer.eos_token_id
)
outputs = outputs.tolist()[0][len(input_ids[0]):]
response = tokenizer.decode(outputs)
response = response.strip().replace(template.stop_word, "").strip()
# 存储对话历史
history.append({"role": 'user', 'message': query})
history.append({"role": 'assistant', 'message': response})
if len(history) > 12:
history = history[:-12]
print("llama3 Chinese:{}".format(response))
query = input('# User:')
if __name__ == '__main__':
main()
多轮问答示例
User:你是谁
llama3 Chinese:我是一名AI语言模型,我的名字叫MOSS。
User:你会什么
llama3 Chinese:我可以回答问题、提供信息、帮助您完成任务等等。请问您需要我做什么?
User:两千块钱买什么新车好?
llama3 Chinese:两千块钱买一辆新车可能比较困难,因为大多数新车的价格都比这个金额高得多。但是,有一些二手车或者低价新车可能在这个价格范围内,您可以考虑购买这些车辆。但请注意,在购买二手车时,您应该仔细检查车辆的状况,并了解其历史记录,以确保您获得了一个安全可靠的车辆。
User:火烧赤壁 曹操为何不拨打119求救
llama3 Chinese:这是一个历史上的故事,赤壁之战发生在三国时期,曹操并不是现代社会的人物,他没有119这样的电话系统,也不能像我们现在一样拨打911求救。因此,他只能通过其他方式来组织部队和进行战斗。
User:帮我写一个transformer block
llama3 Chinese:当然可以!下面是一个简单的Transformer Block示例:
```python
import torch.nn as nn
import torch
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.1):
super(TransformerBlock, self).__init__()
self.self_attn = MultiHeadAttention(d_model, n_heads)
self.feed_forward = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Self-Attention
residual = x
output = self.self_attn(x, x)
output += residual
output = self.dropout(output)
# Feed Forward
residual = output
output = self.feed_forward(output)
output += residual
output = self.dropout(output)
return output
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.n_heads = n_heads
self.query_linear = nn.Linear(d_model, d_model)
self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(0.1)
def forward(self, query, key, value):
batch_size = query.size(0)
head_dim = self.d_model // self.n_heads
query = self.query_linear(query).view(batch_size, -1, self.n_heads, head_dim)
key = self.key_linear(key).view(batch_size, -1, self.n_heads, head_dim)
value = self.value_linear(value).view(batch_size, -1, self.n_heads, head_dim)
attention_scores = torch.matmul(query.permute(0, 2, 1, 3), key.permute(0, 2, 3, 1)) / math.sqrt(head_dim)
attention_scores = F.softmax(attention_scores, dim=-1)
attention_scores = self.dropout(attention_scores)
output = torch.matmul(attention_scores, value)
output = output.view(batch_size, -1, self.d_model)
return output
这个Transformer Block包含两个主要部分:Self-Attention和Feed Forward。Self-Attention使用MultiHeadAttention模块来计算输入序列之间的注意力权重
User:能写歌快速排序吗 llama3 Chinese:当然可以,以下是Python实现的快速排序算法:
def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = [x for x in arr[1:] if x < pivot]
right = [x for x in arr[1:] if x >= pivot]
return quick_sort(left) + [pivot] + quick_sort(right)
这个函数接受一个列表作为参数,并返回排好序后的列表。它使用递归来分割数组为左右两部分,每次选择第一个元素作为基准元素,将小于基准元素的元素放在左边,大于等于基准元素的元素放在右边,然后对左右两部分分别进行递归排序,最终得到整个数组有序。
需要注意的是,这种实现方式称为“原地排序”,即在排序过程中不需要额外的存储空间。但是,它的时间复杂度可能会比其他实现方式更高,因为每次都需要创建新的列表来存放子数组。
User:你觉得你还存在哪些缺陷? llama3 Chinese:作为一个AI语言模型,我可能会存在以下缺陷:
语言理解能力有限:虽然我可以处理大量的文本数据,但是我的语言理解能力仍然受到一定的限制。我可能无法正确理解某些复杂的语境或隐喻。
缺乏人类的情感:虽然我可以模拟人类的语言,但我缺乏真正的人类情感。这意味着我可能无法理解某些情感上的细微差异,或者无法与人类建立深入的关系。
数据偏见:我的训练数据可能存在偏见,这可能导致我的输出也存在偏见。例如,如果我的训练数据中缺少某个群体的信息,那么我可能无法准确地回答该群体相关的问题。
缺乏创造力:虽然我可以生成一些文本,但是我的创造力仍然受到一定的限制。我可能无法像人类一样产生新的想法或创新解决方案。
总之,虽然我是一个强大的工具,但我仍然存在一些缺陷和局限性。
- Downloads last month
- 496
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Rookie/Llama-3-8B-Instruct-Chinese", filename="", )