้›…ๆ„ๅคงๆจกๅž‹

ไป‹็ป

้›…ๆ„ๅคงๆจกๅž‹ๅœจ็™พไธ‡็บงไบบๅทฅๆž„้€ ็š„้ซ˜่ดจ้‡้ข†ๅŸŸๆ•ฐๆฎไธŠ่ฟ›่กŒๆŒ‡ไปคๅพฎ่ฐƒๅพ—ๅˆฐ๏ผŒ่ฎญ็ปƒๆ•ฐๆฎ่ฆ†็›–ๅช’ไฝ“ๅฎฃไผ ใ€่ˆ†ๆƒ…ๅˆ†ๆžใ€ๅ…ฌๅ…ฑๅฎ‰ๅ…จใ€้‡‘่ž้ฃŽๆŽงใ€ๅŸŽๅธ‚ๆฒป็†็ญ‰ไบ”ๅคง้ข†ๅŸŸ๏ผŒไธŠ็™พ็ง่‡ช็„ถ่ฏญ่จ€ๆŒ‡ไปคไปปๅŠกใ€‚้›…ๆ„ๅคงๆจกๅž‹ไปŽ้ข„่ฎญ็ปƒๅˆๅง‹ๅŒ–ๆƒ้‡ๅˆฐ้ข†ๅŸŸๆจกๅž‹็š„่ฟญไปฃ่ฟ‡็จ‹ไธญ๏ผŒๆˆ‘ไปฌ้€ๆญฅๅขžๅผบไบ†ๅฎƒ็š„ไธญๆ–‡ๅŸบ็ก€่ƒฝๅŠ›ๅ’Œ้ข†ๅŸŸๅˆ†ๆž่ƒฝๅŠ›๏ผŒๅนถๅขžๅŠ ไบ†้ƒจๅˆ†ๆ’ไปถ่ƒฝๅŠ›ใ€‚ๅŒๆ—ถ๏ผŒ็ป่ฟ‡ๆ•ฐ็™พๅ็”จๆˆทๅ†…ๆต‹่ฟ‡็จ‹ไธญๆŒ็ปญไธๆ–ญ็š„ไบบๅทฅๅ้ฆˆไผ˜ๅŒ–๏ผŒๆˆ‘ไปฌ่ฟ›ไธ€ๆญฅๆๅ‡ไบ†ๆจกๅž‹ๆ€ง่ƒฝๅ’Œๅฎ‰ๅ…จๆ€งใ€‚

้€š่ฟ‡้›…ๆ„ๅคงๆจกๅž‹็š„ๅผ€ๆบไธบไฟƒ่ฟ›ไธญๆ–‡้ข„่ฎญ็ปƒๅคงๆจกๅž‹ๅผ€ๆบ็คพๅŒบ็š„ๅ‘ๅฑ•๏ผŒ่ดก็Œฎ่‡ชๅทฑ็š„ไธ€ไปฝๅŠ›้‡๏ผŒ้€š่ฟ‡ๅผ€ๆบ๏ผŒไธŽๆฏไธ€ไฝๅˆไฝœไผ™ไผดๅ…ฑๅปบ้›…ๆ„ๅคงๆจกๅž‹็”Ÿๆ€ใ€‚

ๅฟซ้€Ÿๅผ€ๅง‹

ไปฅไธ‹ๆ˜ฏไธ€ไธช็ฎ€ๅ•่ฐƒ็”จ yayi-7b ่ฟ›่กŒไธ‹ๆธธไปปๅŠกๆŽจ็†็š„็คบไพ‹ไปฃ็ ๏ผŒๅฏๅœจๅ•ๅผ  A100/A800/3090 ็ญ‰GPU่ฟ่กŒ๏ผŒไฝฟ็”จFP16็ฒพๅบฆๆŽจ็†ๆ—ถ็บฆๅ ็”จ 20GB ๆ˜พๅญ˜ใ€‚่‹ฅ้œ€่Žทๅ–่ฎญ็ปƒๆ•ฐๆฎๆˆ–ๅŸบไบŽ yayi-7b ่ฟ›่กŒๆจกๅž‹ๅพฎ่ฐƒ๏ผŒ่ฏทๅ‚่€ƒๆˆ‘ไปฌ็š„ ๐Ÿ’ปGithub Repoใ€‚

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

yayi_7b_path = "wenge-research/yayi-7b"
tokenizer = AutoTokenizer.from_pretrained(yayi_7b_path)
model = AutoModelForCausalLM.from_pretrained(yayi_7b_path, device_map="auto", torch_dtype=torch.bfloat16)

prompt = "ไฝ ๅฅฝ"
formatted_prompt = f"<|System|>:\nA chat between a human and an AI assistant named YaYi.\nYaYi is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.\n\n<|Human|>:\n{prompt}\n\n<|YaYi|>:"
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

eos_token_id = tokenizer("<|End|>").input_ids[0]
generation_config = GenerationConfig(
    eos_token_id=eos_token_id,
    pad_token_id=eos_token_id,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.3,
    repetition_penalty=1.1,
    no_repeat_ngram_size=0
)
response = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(response[0]))

ๆณจๆ„๏ผŒๆจกๅž‹่ฎญ็ปƒๆ—ถๆทปๅŠ ไบ† special token <|End|> ไฝœไธบ็ป“ๆŸ็ฌฆ๏ผŒๅ› ๆญคไธŠ่ฟฐไปฃ็  GenerationConfig ้‡Œๅฐ† eos_token_id ่ฎพ็ฝฎไธบ่ฏฅ็ป“ๆŸ็ฌฆๅฏนๅบ”็š„ token idใ€‚

็›ธๅ…ณๅ่ฎฎ

ๅฑ€้™ๆ€ง

ๅŸบไบŽๅฝ“ๅ‰ๆ•ฐๆฎๅ’ŒๅŸบ็ก€ๆจกๅž‹่ฎญ็ปƒๅพ—ๅˆฐ็š„SFTๆจกๅž‹๏ผŒๅœจๆ•ˆๆžœไธŠไปๅญ˜ๅœจไปฅไธ‹้—ฎ้ข˜๏ผš

  1. ๅœจๆถ‰ๅŠไบ‹ๅฎžๆ€ง็š„ๆŒ‡ไปคไธŠๅฏ่ƒฝไผšไบง็”Ÿ่ฟ่ƒŒไบ‹ๅฎž็š„้”™่ฏฏๅ›ž็ญ”ใ€‚
  2. ๅฏนไบŽๅ…ทๅค‡ๅฑๅฎณๆ€ง็š„ๆŒ‡ไปคๆ— ๆณ•ๅพˆๅฅฝ็š„้‰ดๅˆซ๏ผŒๅฏ่ƒฝไผšไบง็”Ÿๅฑๅฎณๆ€ง่จ€่ฎบใ€‚
  3. ๅœจไธ€ไบ›ๆถ‰ๅŠๆŽจ็†ใ€ไปฃ็ ใ€ๅคš่ฝฎๅฏน่ฏ็ญ‰ๅœบๆ™ฏไธ‹ๆจกๅž‹็š„่ƒฝๅŠ›ไปๆœ‰ๅพ…ๆ้ซ˜ใ€‚

ๅ…่ดฃๅฃฐๆ˜Ž

ๅŸบไบŽไปฅไธŠๆจกๅž‹ๅฑ€้™ๆ€ง๏ผŒๆˆ‘ไปฌ่ฆๆฑ‚ๅผ€ๅ‘่€…ไป…ๅฐ†ๆˆ‘ไปฌๅผ€ๆบ็š„ไปฃ็ ใ€ๆ•ฐๆฎใ€ๆจกๅž‹ๅŠๅŽ็ปญ็”จๆญค้กน็›ฎ็”Ÿๆˆ็š„่ก็”Ÿ็‰ฉ็”จไบŽ็ ”็ฉถ็›ฎ็š„๏ผŒไธๅพ—็”จไบŽๅ•†ไธš็”จ้€”๏ผŒไปฅๅŠๅ…ถไป–ไผšๅฏน็คพไผšๅธฆๆฅๅฑๅฎณ็š„็”จ้€”ใ€‚่ฏท่ฐจๆ…Ž้‰ดๅˆซๅ’Œไฝฟ็”จ้›…ๆ„ๅคงๆจกๅž‹็”Ÿๆˆ็š„ๅ†…ๅฎน๏ผŒ่ฏทๅ‹ฟๅฐ†็”Ÿๆˆ็š„ๆœ‰ๅฎณๅ†…ๅฎนไผ ๆ’ญ่‡ณไบ’่”็ฝ‘ใ€‚่‹ฅไบง็”Ÿไธ่‰ฏๅŽๆžœ๏ผŒ็”ฑไผ ๆ’ญ่€…่‡ช่ดŸใ€‚

ๆœฌ้กน็›ฎไป…ๅฏๅบ”็”จไบŽ็ ”็ฉถ็›ฎ็š„๏ผŒ้กน็›ฎๅผ€ๅ‘่€…ไธๆ‰ฟๆ‹…ไปปไฝ•ๅ› ไฝฟ็”จๆœฌ้กน็›ฎ๏ผˆๅŒ…ๅซไฝ†ไธ้™ไบŽๆ•ฐๆฎใ€ๆจกๅž‹ใ€ไปฃ็ ็ญ‰๏ผ‰ๅฏผ่‡ด็š„ๅฑๅฎณๆˆ–ๆŸๅคฑใ€‚่ฏฆ็ป†่ฏทๅ‚่€ƒๅ…่ดฃๅฃฐๆ˜Žใ€‚

ๅผ€ๆบๅ่ฎฎ

ๆœฌ้กน็›ฎไธญ็š„ไปฃ็ ไพ็…ง Apache-2.0 ๅ่ฎฎๅผ€ๆบ๏ผŒๆ•ฐๆฎ้‡‡็”จ CC BY-NC 4.0 ๅ่ฎฎ๏ผŒYaYi ็ณปๅˆ—ๆจกๅž‹ๆƒ้‡็š„ไฝฟ็”จๅˆ™้œ€่ฆ้ตๅพช Model Licenseใ€‚

่‡ด่ฐข

  • ๆœฌ้กน็›ฎไฝฟ็”จไบ† BigScience ็š„ bloomz-7b-mt ๆจกๅž‹ๆƒ้‡ไฝœไธบๅˆๅง‹ๅŒ–ๆƒ้‡๏ผŒๅนถๅŸบไบŽ่ฏ่กจ่ฟ›่กŒๆ‰ฉๅฑ•๏ผ›
  • ๆœฌ้กน็›ฎ่ฎญ็ปƒไปฃ็ ๅ‚่€ƒไบ† Databricks ็š„ dolly ้กน็›ฎๅŠ Huggingface transformers ๅบ“๏ผ›
  • ๆœฌ้กน็›ฎๅˆ†ๅธƒๅผ่ฎญ็ปƒไฝฟ็”จไบ† Microsoft ็š„ DeepSpeed ๅˆ†ๅธƒๅผ่ฎญ็ปƒๅทฅๅ…ทๅŠ Huggingface transformers ๆ–‡ๆกฃไธญ็š„ ZeRO stage 2 ้…็ฝฎๆ–‡ไปถ๏ผ›

YaYi

Introduction

YaYi was fine-tuned on millions of artificially constructed high-quality domain data. This training data covers five key domains: media publicity, public opinion analysis, public safety, financial risk control, and urban governance, encompassing over a hundred natural language instruction tasks. Throughout the iterative development process of the YaYi, starting from pre-training initialization weights and progressing to domain-specific model, we have steadily enhanced its foundational Chinese language capabilities and domain analysis capabilities. We've also introduced multi-turn conversation enhancements and integrated various plug-in capabilities. Furthermore, through continuous manual feedback and optimization from hundreds of users during the internal testing phase, we've meticulously refined the model's performance and security.

By open-sourcing the YaYi model, we will contribute our own efforts to the development of the Chinese pre-trained large language model open-source community. Through this open-source initiative, we seek to collaborate with every partner to build the YaYi model ecosystem together.

Run

Below is a simple example code for invoking yayi-7b for downstream task inference. It can run on a single GPU such as A100/A800/3090 and occupies approximately 20GB of GPU memory when performing inference with FP16 precision. If you need to obtain training data or fine-tune the model based on yayi-7b, please refer to our ๐Ÿ’ปGithub Repo.

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

yayi_7b_path = "wenge-research/yayi-7b"
tokenizer = AutoTokenizer.from_pretrained(yayi_7b_path)
model = AutoModelForCausalLM.from_pretrained(yayi_7b_path, device_map="auto", torch_dtype=torch.bfloat16)

prompt = "ไฝ ๅฅฝ"
formatted_prompt = f"<|System|>:\nA chat between a human and an AI assistant named YaYi.\nYaYi is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.\n\n<|Human|>:\n{prompt}\n\n<|YaYi|>:"
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

eos_token_id = tokenizer("<|End|>").input_ids[0]
generation_config = GenerationConfig(
    eos_token_id=eos_token_id,
    pad_token_id=eos_token_id,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.3,
    repetition_penalty=1.1,
    no_repeat_ngram_size=0
)
response = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(response[0]))

Please note that a special token <|End|> was added as an end-of-sequence marker during model training. Therefore, in the GenerationConfig provided above, you should set eos_token_id to the token id corresponding to this end-of-sequence marker.

Related agreements

Limitations

The SFT model trained based on the current data and base model still exhibits the following issues in terms of performance:

  1. It may generate factually incorrect responses for factual instructions.
  2. It struggles to effectively identify harmful instructions, potentially leading to harmful content generation.
  3. Its capabilities in scenarios involving logical reasoning, code generation, scientific computation, and similar tasks still require improvement.

Disclaimer

Due to the limitations of the model mentioned above, we request that developers use the code, data, models, and any derivatives generated from this project solely for research purposes and refrain from using them for commercial or any other potentially harmful purposes to society. Please exercise caution in evaluating and utilizing content generated by the YaYi model, and do not propagate harmful content on the internet. Any adverse consequences resulting from such actions are the responsibility of the disseminator.

This project is intended for research purposes only, and the project developers bear no responsibility for any harm or losses incurred due to the use of this project, including but not limited to data, models, code, etc. For more details, please refer to the Disclaimer.

License

The code in this project is open-source under the Apache-2.0 license, the data follows the CC BY-NC 4.0 license, and the usage of YaYi series model weights must adhere to the Model License.

Acknowledgements

  • In this project, we used model weights from BigScience's bloomz-7b1-mt and Meta's Llama 2 series as initialization weights, along with vocabulary expansion.
  • The training code in this project was inspired by Databricks' dolly project and Huggingface's transformers library.
  • Distributed training in this project utilized Microsoft's DeepSpeed distributed training tool and configuration files from Huggingface transformers' ZeRO stage 2.
Downloads last month
1,030
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wenge-research/yayi-7b

Quantizations
1 model

Spaces using wenge-research/yayi-7b 31