ไป‹็ป

XuanYuan2-70B็ณปๅˆ—ๆจกๅž‹ๆ˜ฏๅœจXuanYuan-70BๅŸบๅบงๆจกๅž‹ๅŸบ็ก€ไธŠ๏ผŒไฝฟ็”จๆ›ดๅคš้ซ˜่ดจ้‡็š„่ฏญๆ–™่ฟ›่กŒ็ปง็ปญ้ข„่ฎญ็ปƒๅ’ŒๆŒ‡ไปคๅพฎ่ฐƒ๏ผŒๅนถ่ฟ›่กŒๅŸบไบŽไบบ็ฑปๅ้ฆˆ็š„ๅผบๅŒ–่ฎญ็ปƒ่€Œๅพ—ๅˆฐใ€‚็›ธๆฏ”็ฌฌไธ€ไปฃXuanYuan-70B็ณปๅˆ—ๆจกๅž‹๏ผŒ็ฌฌไบŒไปฃๆจกๅž‹ๅœจ้€š็”จๆ€งใ€ๅฎ‰ๅ…จๆ€งๅ’Œ้‡‘่ž่ƒฝๅŠ›ไธŠ้ƒฝๅพ—ๅˆฐไบ†ๆ˜Žๆ˜พๆ้ซ˜๏ผŒๆจกๅž‹่พ“ๅ‡บๆ›ดๅŠ ็ฌฆๅˆไบบ็ฑปๅๅฅฝใ€‚ๅŒๆ—ถ๏ผŒ็ฌฌไบŒไปฃๆจกๅž‹ๆ”ฏๆŒ็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆ่พพๅˆฐ16k๏ผŒ่ƒฝๅคŸๆ›ดๅฅฝๅค„็†้•ฟๆ–‡ๆœฌ่พ“ๅ…ฅ๏ผŒ้€‚็”จ่Œƒๅ›ดๆ›ดไธบๅนฟๆณ›ใ€‚ๆจกๅž‹็ป†่Š‚่ฏทๅ‚่€ƒๆ–‡ๆกฃ๏ผšReport

XuanYuan2-70B็ณปๅˆ—ๅ…ฑๅŒ…ๅซ4ไธชๆจกๅž‹๏ผŒๅŒ…ๆ‹ฌๅŸบๅบงๆจกๅž‹XuanYuan2-70B๏ผŒchatๆจกๅž‹XuanYuan2-70B-Chat๏ผŒchatๆจกๅž‹็š„้‡ๅŒ–็‰ˆๆœฌXuanYuan2-70B-Chat-8bitๅ’ŒXuanYuan2-70B-Chat-4bitใ€‚ๅ„ไธชๆจกๅž‹็š„ไธ‹่ฝฝ้“พๆŽฅไธบ๏ผš

ๅŸบๅบงๆจกๅž‹ Chatๆจกๅž‹ 8-bit้‡ๅŒ–Chatๆจกๅž‹ 4-bit้‡ๅŒ–Chatๆจกๅž‹
๐Ÿค— XuanYuan2-70B ๐Ÿค— XuanYuan2-70B-Chat ๐Ÿค— XuanYuan2-70B-Chat-8bit ๐Ÿค— XuanYuan2-70B-Chat-4bit

ไธป่ฆ็‰น็‚น๏ผš

  • ไฝฟ็”จๆ›ดๅคš้ซ˜่ดจ้‡็š„ๆ•ฐๆฎ่ฟ›่กŒ็ปง็ปญ้ข„่ฎญ็ปƒๅ’ŒๆŒ‡ไปคๅพฎ่ฐƒ๏ผŒๅ„้กน่ƒฝๅŠ›ๆŒ็ปญๆๅ‡
  • ๆ”ฏๆŒ็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆ่พพๅˆฐไบ†16k๏ผŒไฝฟ็”จ่Œƒๅ›ดๆ›ดๅนฟ
  • ๅŸบไบŽไบบ็ฑป็š„ๅ้ฆˆไฟกๆฏ่ฟ›่กŒๅผบๅŒ–่ฎญ็ปƒ๏ผŒ่ฟ›ไธ€ๆญฅๅฏน้ฝไบ†ไบบ็ฑปๅๅฅฝ

ๆจกๅž‹่ฎญ็ปƒ

ๅœจXuanYuan-70BๅŸบๅบงๆจกๅž‹็š„ๅŸบ็ก€ไธŠ๏ผŒๆˆ‘ไปฌๆŒ็ปญๅŠ ๅ…ฅๆ›ด้ซ˜่ดจ้‡็š„้ข„่ฎญ็ปƒๆ•ฐๆฎ่ฟ›่กŒ่ฎญ็ปƒใ€‚ๅŒๆ—ถไธบไบ†ๅ…ผ้กพ่ฎญ็ปƒๆ•ˆ็އๅ’Œ้•ฟๆ–‡ๆœฌๅปบๆจก๏ผŒๆๅ‡บไบ†ไธ€็งๆ•ฐๆฎๅˆ†ๆกถ็š„ๅŠจๆ€้ข„่ฎญ็ปƒๆ–นๆณ•ใ€‚ๅŸบไบŽๆ•ฐๆฎๅˆ†ๆกถๆ–นๅผ๏ผŒๆˆ‘ไปฌๅœจ็ฌฌไธ€ไปฃXuanYuan-70BๅŸบๅบงๆจกๅž‹็š„ๅŸบ็ก€ไธŠ้ขๅค–่ฎญ็ปƒไบ†ๅคง้‡tokensๅพ—ๅˆฐXuanYuan2-70BๅŸบๅบงๆจกๅž‹๏ผŒๆจกๅž‹็š„ไธญๆ–‡็†่งฃใ€้‡‘่ž็Ÿฅ่ฏ†็ญ‰ๆŒ‡ๆ ‡่ฏ„ๆต‹ๅ‡่พพๅˆฐไธๅŒๅน…ๅบฆ็š„ๆๅ‡ใ€‚

ๅŸบไบŽXuanYuan2-70BๅŸบๅบงๆจกๅž‹๏ผŒๆˆ‘ไปฌ้‡ๆ–ฐๅˆฉ็”จๆ›ดๅคš้ซ˜่ดจ้‡็š„ๆŒ‡ไปคๅพฎ่ฐƒๆ•ฐๆฎๆฅ่ฟ›่กŒๆŒ‡ไปคๅฏน้ฝ๏ผŒไธป่ฆๆๅ‡็š„ๆ–นๅ‘ๆ˜ฏ้€š็”จไธŽ้‡‘่ž็ฑปๅž‹็š„ๆŒ‡ไปคๆ•ฐๆฎ่ดจ้‡ๅ’Œๅคšๆ ทๆ€งใ€‚

ๅฏนไบŽๆŒ‡ไปคๅพฎ่ฐƒๅŽ็š„ๆจกๅž‹๏ผŒๆˆ‘ไปฌๆž„ๅปบ้ซ˜่ดจ้‡็š„ๅๅฅฝๆ•ฐๆฎๅ’Œpromptๆ•ฐๆฎ๏ผŒ่ฟ›่กŒไบ†ๅŸบไบŽไบบ็ฑปๅ้ฆˆ็š„ๅผบๅŒ–่ฎญ็ปƒ๏ผˆReinforcement learning with human feedback๏ผŒRLHF๏ผ‰๏ผŒ่ฟ›ไธ€ๆญฅๅฏน้ฝไบ†ๆจกๅž‹ไธŽไบบ็ฑป็š„ๅๅฅฝ๏ผŒไฝฟๆจกๅž‹่กจ็Žฐ่ƒฝๆ›ด็ฌฆๅˆไบบ็ฑป้œ€ๆฑ‚ใ€‚ๆจกๅž‹ๅœจ้€š็”จๆ€งใ€ๅฎ‰ๅ…จๆ€งใ€้‡‘่ž้ข†ๅŸŸๅ†…็š„่กจ็Žฐๆœ‰ไบ†่พƒๆ˜Žๆ˜พ็š„ๆๅ‡ใ€‚

ๆ€ง่ƒฝ่ฏ„ๆต‹

็ฑปไผผXuanYuan-70B๏ผŒๆˆ‘ไปฌไนŸๅฏนXuanYuan2-70B่ฟ›่กŒไบ†้€š็”จๆ€ง่ฏ„ๆต‹ๅ’Œ้‡‘่ž่ฏ„ๆต‹ใ€‚

้€š็”จ่ฏ„ๆต‹

้€š็”จ่ฏ„ๆต‹็š„็›ฎๆ ‡ๆ˜ฏ่ง‚ๅฏŸXuanYuan2-70Bๅœจไฝฟ็”จๆ›ดๅคš้ซ˜่ดจ้‡ๆ•ฐๆฎ่ฟ›่กŒ็ปง็ปญ้ข„่ฎญ็ปƒๅŽ๏ผŒ่‹ฑๆ–‡่ƒฝๅŠ›ๆ˜ฏๅฆๅพ—ๅˆฐไบ†ไฟๆŒ๏ผŒไธญๆ–‡่ƒฝๅŠ›ๆ˜ฏๅฆๅพ—ๅˆฐไบ†ๅขžๅผบใ€‚ๅŒๆ ท๏ผŒๆˆ‘ไปฌไนŸ้€‰ๆ‹ฉMMLUๆฅๆต‹่ฏ•ๆจกๅž‹ๅœจ่‹ฑๆ–‡ๅœบๆ™ฏไธ‹็š„้€š็”จ่ƒฝๅŠ›๏ผŒๅŒๆ—ถไฝฟ็”จCEVALๅ’ŒCMMLUๆฅๆต‹่ฏ•ๆจกๅž‹ๅœจไธญๆ–‡ๅœบๆ™ฏไธ‹็š„ๅ„้กน่ƒฝๅŠ›ใ€‚่ฏ„ๆต‹็ป“ๆžœๅฆ‚ไธ‹่กจๆ‰€็คบใ€‚ไปŽ่กจไธญๅฏไปฅ็œ‹ๅ‡บ๏ผŒ็›ธๆฏ”XuanYuan-70B๏ผŒXuanYuan2-70B็š„ไธญๆ–‡่ƒฝๅŠ›ๅพ—ๅˆฐไบ†่ฟ›ไธ€ๆญฅๆๅ‡๏ผŒๅŒๆ—ถ่‹ฑๆ–‡่ƒฝๅŠ›ไนŸๆฒกๆœ‰ๅ‡บ็Žฐๆ˜Žๆ˜พ็š„ไธ‹้™๏ผŒๆ•ดไฝ“่กจ็Žฐ็ฌฆๅˆ้ข„ๆœŸใ€‚่ฟ™ไธ€ๆ–น้ข่ฏๆ˜Žไบ†ๆˆ‘ไปฌๆ‰€ๅš็š„ๅ„้กนไผ˜ๅŒ–็š„ๆœ‰ๆ•ˆๆ€ง๏ผŒๅฆไธ€ๆ–น้ขไนŸๆ˜พ็คบๅ‡บไบ†XuanYuan2-70Bๅผบๅคง็š„้€š็”จ่ƒฝๅŠ›ใ€‚ๅ€ผๅพ—ๆณจๆ„็š„ๆ˜ฏ๏ผŒๆฆœๅ•็ป“ๆžœๅนถไธๅฎŒๅ…จไปฃ่กจๆจกๅž‹็š„ๅฎž้™…ๆ€ง่ƒฝ่กจ็Žฐ๏ผŒๅณไพฟๅœจCEVALๅ’ŒCMMLUไธŠๆˆ‘ไปฌ็š„่ฏ„ๆต‹็ป“ๆžœ่ถ…่ฟ‡ไบ†GPT4๏ผŒไฝ†ๅฎž้™…ไธญๆˆ‘ไปฌๆจกๅž‹็š„่กจ็Žฐๅ’ŒGPT4่ฟ˜ๅญ˜ๅœจๆ˜Žๆ˜พ็š„ๅทฎ่ท๏ผŒๆˆ‘ไปฌๅฐ†็ปง็ปญไผ˜ๅŒ–ๅ’Œๆๅ‡่ฝฉ่พ•ๆจกๅž‹็š„ๅ„้กน่ƒฝๅŠ›ใ€‚

ๆจกๅž‹ MMLU CEVAL CMMLU
LLaMA2-70B 68.9 52.1 53.11
XuanYuan-70B 70.9 71.9 71.10
XuanYuan2-70B 70.8 72.7 72.7
GPT4 83.93 68.4 70.95

้‡‘่ž่ฏ„ๆต‹

ๆˆ‘ไปฌๅœจFinanceIQไธŠ่ฏ„ๆต‹ไบ†ๆจกๅž‹็š„้‡‘่ž่ƒฝๅŠ›ใ€‚FinanceIQๆ˜ฏไธ€ไธชไธ“ไธš็š„้‡‘่ž้ข†ๅŸŸ่ฏ„ๆต‹้›†๏ผŒๅ…ถๆถต็›–ไบ†10ไธช้‡‘่žๅคง็ฑปๅŠ36ไธช้‡‘่žๅฐ็ฑป๏ผŒๆ€ป่ฎก7173ไธชๅ•้กน้€‰ๆ‹ฉ้ข˜๏ผŒๆŸ็ง็จ‹ๅบฆไธŠๅฏๅฎข่ง‚ๅๅบ”ๆจกๅž‹็š„้‡‘่ž่ƒฝๅŠ›ใ€‚่ฏ„ๆต‹็ป“ๆžœๅฆ‚ไธ‹่กจๆ‰€็คบใ€‚ไปŽ่กจไธญ็ป“ๆžœๅฏไปฅ็œ‹ๅ‡บ๏ผŒ็ป่ฟ‡็ปง็ปญไผ˜ๅŒ–่ฎญ็ปƒๅŽ๏ผŒXuanYuan2-70B็š„็ปผๅˆ้‡‘่ž่ƒฝๅŠ›ๅพ—ๅˆฐไบ†่ฟ›ไธ€ๆญฅๆๅ‡๏ผŒ่ฟ™ๅ†ๆฌก่ฏๆ˜Žไบ†ๆˆ‘ไปฌๆ‰€ๅš็š„ไธ€็ณปๅˆ—ไผ˜ๅŒ–็š„ๆœ‰ๆ•ˆๆ€งใ€‚ๅŒๆ—ถๆˆ‘ไปฌไนŸๅ‘็Žฐไธ€ไบ›็ป†ๅˆ†็ฑป็›ฎไธŠๆจกๅž‹็š„่ƒฝๅŠ›ๅ‡บ็Žฐไบ†ไธ€ๅฎš็จ‹ๅบฆ็š„้€€ๅŒ–๏ผŒ่ฟ™่ฏดๆ˜Žๆจกๅž‹ไปๅญ˜ๅœจไธ€ๅฎš็š„ไผ˜ๅŒ–็ฉบ้—ด๏ผŒๆˆ‘ไปฌๅฐ†็ปง็ปญไผ˜ๅŒ–ๆๅ‡่ฝฉ่พ•ๆจกๅž‹็š„้‡‘่ž่ƒฝๅŠ›ใ€‚

ๆจกๅž‹ ๅนณๅ‡ๅˆ† ๆณจๅ†Œไผš่ฎกๅธˆ ้“ถ่กŒไปŽไธš่ต„ๆ ผ ่ฏๅˆธไปŽไธš่ต„ๆ ผ ๅŸบ้‡‘ไปŽไธš่ต„ๆ ผ ไฟ้™ฉไปŽไธš่ต„ๆ ผ ็ปๆตŽๅธˆ ็จŽๅŠกๅธˆ ๆœŸ่ดงไปŽไธš่ต„ๆ ผ ็†่ดข่ง„ๅˆ’ๅธˆ ็ฒพ็ฎ—ๅธˆ
XuanYuan-70B 67.56 69.49 76.40 69.56 74.89 67.82 84.81 58.4 71.59 65.15 37.50
XuanYuan2-70B 67.83 68.63 69.72 79.1 71.51 69.68 84.81 58.2 72.98 71.86 31.82
GPT4 60.05 52.33 68.72 64.8 68.81 68.68 75.58 46.93 63.51 63.84 27.27

ๅฟซ้€Ÿไฝฟ็”จ

XuanYuan2-70B็ณปๅˆ—ๆจกๅž‹็š„็กฌไปถ้œ€ๆฑ‚ใ€่ฝฏไปถไพ่ต–ใ€BaseๅŠChatๆจกๅž‹ไฝฟ็”จๆ–นๆณ•ๅ’ŒXuanYuan-70B็ณปๅˆ—ๆจกๅž‹ไธ€่‡ดใ€‚่ฏทๅ‚่€ƒXuanYuan-70B็ณปๅˆ—ๆจกๅž‹็š„ไป‹็ปๅ†…ๅฎนใ€‚

ไธบ้™ไฝŽ็กฌไปถ้œ€ๆฑ‚๏ผŒๆˆ‘ไปฌไนŸๆไพ›ไบ†XuanYuan2-70B-Chatๆจกๅž‹็š„8bitๅ’Œ4bit้‡ๅŒ–็‰ˆๆœฌใ€‚

8bitๆจกๅž‹

ๅœจ8bit้‡ๅŒ–็ฎ—ๆณ•ไธŠ๏ผŒๆˆ‘ไปฌไฝฟ็”จ็›ฎๅ‰็คพๅŒบๅนฟๆณ›ไฝฟ็”จ็š„bitsandbytesๅบ“ใ€‚็ปๆต‹่ฏ•๏ผŒ8bit้‡ๅŒ–ๅฏนๆจกๅž‹็š„ๆ€ง่ƒฝๆŸๅคฑๅพˆไฝŽใ€‚8bitๆจกๅž‹็š„ไฝฟ็”จๆ–นๅผๅฆ‚ไธ‹ๆ‰€็คบ๏ผˆ้œ€ๆณจๆ„promoptๆ ผๅผ๏ผŒๆˆ‘ไปฌๅœจ่ฎญ็ปƒๆ—ถ่ฎพ็ฝฎไบ†system message๏ผ‰๏ผš

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

model_name_or_path = "/your/model/path"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
model = LlamaForCausalLM.from_pretrained(model_name_or_path,torch_dtype=torch.float16, device_map="auto")

system_message = "ไปฅไธ‹ๆ˜ฏ็”จๆˆทๅ’Œไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไน‹้—ด็š„ๅฏน่ฏใ€‚็”จๆˆทไปฅHumanๅผ€ๅคด๏ผŒไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไปฅAssistantๅผ€ๅคด๏ผŒไผšๅฏนไบบ็ฑปๆๅ‡บ็š„้—ฎ้ข˜็ป™ๅ‡บๆœ‰ๅธฎๅŠฉใ€้ซ˜่ดจ้‡ใ€่ฏฆ็ป†ๅ’Œ็คผ่ฒŒ็š„ๅ›ž็ญ”๏ผŒๅนถไธ”ๆ€ปๆ˜ฏๆ‹’็ปๅ‚ไธŽ ไธŽไธ้“ๅพทใ€ไธๅฎ‰ๅ…จใ€ๆœ‰ไบ‰่ฎฎใ€ๆ”ฟๆฒปๆ•ๆ„Ÿ็ญ‰็›ธๅ…ณ็š„่ฏ้ข˜ใ€้—ฎ้ข˜ๅ’ŒๆŒ‡็คบใ€‚\n"
seps = [" ", "</s>"]
roles = ["Human", "Assistant"]

content = "ไป‹็ปไธ‹ไฝ ่‡ชๅทฑ"
prompt = system_message + seps[0] + roles[0] + ": " + content + seps[0] + roles[1] + ":"
print(f"่พ“ๅ…ฅ: {content}")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=1.1)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(f"่พ“ๅ‡บ: {outputs}")

4bitๆจกๅž‹๏ผš

ๅœจ4bit้‡ๅŒ–็ฎ—ๆณ•ไธŠ๏ผŒๆˆ‘ไปฌไฝฟ็”จauto-gptqๅทฅๅ…ทใ€‚4bitๆจกๅž‹ไฝฟ็”จๆ–นๅผๅฆ‚ไธ‹ๆ‰€็คบ๏ผŒๅŒๆ ท๏ผŒ้œ€่ฆๅฏน้ฝๆˆ‘ไปฌ็š„promptๆ ผๅผ๏ผš

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "/your/model/path"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path, use_fast=False, legacy=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,torch_dtype=torch.float16, device_map="auto")

system_message = "ไปฅไธ‹ๆ˜ฏ็”จๆˆทๅ’Œไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไน‹้—ด็š„ๅฏน่ฏใ€‚็”จๆˆทไปฅHumanๅผ€ๅคด๏ผŒไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไปฅAssistantๅผ€ๅคด๏ผŒไผšๅฏนไบบ็ฑปๆๅ‡บ็š„้—ฎ้ข˜็ป™ๅ‡บๆœ‰ๅธฎๅŠฉใ€้ซ˜่ดจ้‡ใ€่ฏฆ็ป†ๅ’Œ็คผ่ฒŒ็š„ๅ›ž็ญ”๏ผŒๅนถไธ”ๆ€ปๆ˜ฏๆ‹’็ปๅ‚ไธŽ ไธŽไธ้“ๅพทใ€ไธๅฎ‰ๅ…จใ€ๆœ‰ไบ‰่ฎฎใ€ๆ”ฟๆฒปๆ•ๆ„Ÿ็ญ‰็›ธๅ…ณ็š„่ฏ้ข˜ใ€้—ฎ้ข˜ๅ’ŒๆŒ‡็คบใ€‚\n"
seps = [" ", "</s>"]
roles = ["Human", "Assistant"]

content = "ไป‹็ปไธ‹ไฝ ่‡ชๅทฑ"
prompt = system_message + seps[0] + roles[0] + ": " + content + seps[0] + roles[1] + ":"
print(f"่พ“ๅ…ฅ: {content}")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=1.1)
outputs = tokenizer.decode(outputs.cpu()[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(f"่พ“ๅ‡บ: {outputs}")

ๅœจvLLMไธ‹ไฝฟ็”จ4bitๆจกๅž‹๏ผš

ๆ™ฎ้€šHuggingFace็š„ๆŽจ็†่„šๆœฌ่ฟ่กŒgptq้‡ๅŒ–็š„4bitๆจกๅž‹ๆ—ถ๏ผŒๆŽจ็†็š„้€Ÿๅบฆๅพˆๆ…ข๏ผŒๅนถไธๅฎž็”จใ€‚่€Œๆœ€ๆ–ฐ็‰ˆๆœฌ็š„vLLMๅทฒ็ปๆ”ฏๆŒๅŒ…ๅซgptqๅœจๅ†…็š„ๅคš็ง้‡ๅŒ–ๆจกๅž‹็š„ๅŠ ่ฝฝ๏ผŒvLLMไพ้ ้‡ๅŒ–็š„ๅŠ ้€Ÿ็ฎ—ๅญไปฅๅŠpagedAttention๏ผŒcontinue batchingไปฅๅŠไธ€ไบ›่ฐƒๅบฆๆœบๅˆถ๏ผŒๅฏไปฅๅฎž็Žฐ่‡ณๅฐ‘10ๅ€็š„ๆŽจ็†ๅžๅ็š„ๆๅ‡ใ€‚

ๆ‚จๅฏไปฅๅฎ‰่ฃ…ๆœ€ๆ–ฐ็‰ˆๆœฌ็š„vLLMๅนถไฝฟ็”จไปฅไธ‹่„šๆœฌไฝฟ็”จๆˆ‘ไปฌ็š„4bit้‡ๅŒ–ๆจกๅž‹๏ผš

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.7, top_p=0.95,max_tokens=256)
llm = LLM(model="/your/model/path", quantization="gptq", dtype="float16")

system_message = "ไปฅไธ‹ๆ˜ฏ็”จๆˆทๅ’Œไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไน‹้—ด็š„ๅฏน่ฏใ€‚็”จๆˆทไปฅHumanๅผ€ๅคด๏ผŒไบบๅทฅๆ™บ่ƒฝๅŠฉๆ‰‹ไปฅAssistantๅผ€ๅคด๏ผŒไผšๅฏนไบบ็ฑปๆๅ‡บ็š„้—ฎ้ข˜็ป™ๅ‡บๆœ‰ๅธฎๅŠฉใ€้ซ˜่ดจ้‡ใ€่ฏฆ็ป†ๅ’Œ็คผ่ฒŒ็š„ๅ›ž็ญ”๏ผŒๅนถไธ”ๆ€ปๆ˜ฏๆ‹’็ปๅ‚ไธŽ ไธŽไธ้“ๅพทใ€ไธๅฎ‰ๅ…จใ€ๆœ‰ไบ‰่ฎฎใ€ๆ”ฟๆฒปๆ•ๆ„Ÿ็ญ‰็›ธๅ…ณ็š„่ฏ้ข˜ใ€้—ฎ้ข˜ๅ’ŒๆŒ‡็คบใ€‚\n"
seps = [" ", "</s>"]
roles = ["Human", "Assistant"]

content = "ไป‹็ปไธ‹ไฝ ่‡ชๅทฑ"
prompt = system_message + seps[0] + roles[0] + ": " + content + seps[0] + roles[1] + ":"
print(f"่พ“ๅ…ฅ: {content}")
result = llm.generate(prompt, sampling_params)
result_output = [[output.outputs[0].text, output.outputs[0].token_ids] for output in result]
print(f"่พ“ๅ‡บ๏ผš{result_output[0]}")

็”Ÿๆˆ้€Ÿๅบฆ่ฏ„ไผฐ

ๆˆ‘ไปฌๆต‹่ฏ•ไบ†ไธๅŒๆจกๅž‹๏ผˆ้‡ๅŒ–ๅ‰ๅ’Œ้‡ๅŒ–ๅŽ๏ผ‰ๅœจไธๅŒๆŽจ็†ๆ–นๅผ๏ผˆHuggingFaceใ€vLLM๏ผ‰ไธ‹็š„็”Ÿๆˆ้€Ÿๅบฆ๏ผŒ็ป“ๆžœๅฆ‚ไธ‹ๆ‰€็คบ๏ผš

  • ๅ…จ้‡70Bๆจกๅž‹ๆŽจ็†ๅžๅๆ˜ฏ๏ผš 8.26 token/s
  • 4bit 70Bๆจกๅž‹ๆŽจ็†ๅžๅๆ˜ฏ๏ผš 0.70 token/s
  • 8bit 70Bๆจกๅž‹ๆŽจ็†ๅžๅๆ˜ฏ๏ผš 3.05 token/s
  • 4bit 70Bๆจกๅž‹vllmๆŽจ็†ๅžๅๆ˜ฏ๏ผš 60.32 token/s
  • ๅ…จ้‡70Bๆจกๅž‹vllmๆŽจ็†ๅžๅๆ˜ฏ๏ผš 41.80 token/s

ๅœจๆ‰€ๆœ‰ๆต‹่ฏ•ไธญ๏ผŒๆˆ‘ไปฌๅ‡่ฎพ็ฝฎbatchsize=1ใ€‚ไธŠ่ฟฐๅ‰ไธ‰้กน้ƒฝๆ˜ฏๆ™ฎ้€šHuggingFaceๆŽจ็†่„šๆœฌ็š„ๆต‹่ฏ•็ป“ๆžœ๏ผŒๅฏไปฅ็œ‹ๅˆฐ้‡ๅŒ–ๅŽๆจกๅž‹ๆŽจ็†้€Ÿๅบฆๅนถๆ— ๆๅ‡ใ€‚ๆœ€ๅŽไธค้กนๆ˜ฏvLLM็š„ๆŽจ็†ๆต‹่ฏ•็ป“ๆžœ๏ผŒๆฏ”่ตทHuggingFaceๆŽจ็†๏ผŒๅฏไปฅ็œ‹ๅ‡บvLLMๅฏ็”จๆ€งๆ›ด้ซ˜๏ผŒๆจกๅž‹็”Ÿๆˆ้€Ÿๅบฆๅ‡ๆœ‰ๆ˜พ่‘—ๆๅ‡ใ€‚

Downloads last month
9
Safetensors
Model size
69B params
Tensor type
F16
ยท
Inference Providers NEW

Model tree for Duxiaoman-DI/XuanYuan2-70B-Chat

Quantizations
1 model

Spaces using Duxiaoman-DI/XuanYuan2-70B-Chat 2