Text Generation
Transformers
Safetensors
Chinese
English
qwen
custom_code

Qwen-7B


๐Ÿค— Hugging Face   |   ๐Ÿค– ModelScope   |    ๐Ÿ“‘ Paper    ๏ฝœ   ๐Ÿ–ฅ๏ธ Demo
WeChat (ๅพฎไฟก)   |   Discord   ๏ฝœ   API


ไป‹็ป (Introduction)

้€šไน‰ๅƒ้—ฎ-7B๏ผˆQwen-7B๏ผ‰ๆ˜ฏ้˜ฟ้‡Œไบ‘็ ”ๅ‘็š„้€šไน‰ๅƒ้—ฎๅคงๆจกๅž‹็ณปๅˆ—็š„70ไบฟๅ‚ๆ•ฐ่ง„ๆจก็š„ๆจกๅž‹ใ€‚Qwen-7Bๆ˜ฏๅŸบไบŽTransformer็š„ๅคง่ฏญ่จ€ๆจกๅž‹, ๅœจ่ถ…ๅคง่ง„ๆจก็š„้ข„่ฎญ็ปƒๆ•ฐๆฎไธŠ่ฟ›่กŒ่ฎญ็ปƒๅพ—ๅˆฐใ€‚้ข„่ฎญ็ปƒๆ•ฐๆฎ็ฑปๅž‹ๅคšๆ ท๏ผŒ่ฆ†็›–ๅนฟๆณ›๏ผŒๅŒ…ๆ‹ฌๅคง้‡็ฝ‘็ปœๆ–‡ๆœฌใ€ไธ“ไธšไนฆ็ฑใ€ไปฃ็ ็ญ‰ใ€‚ๅŒๆ—ถ๏ผŒๅœจQwen-7B็š„ๅŸบ็ก€ไธŠ๏ผŒๆˆ‘ไปฌไฝฟ็”จๅฏน้ฝๆœบๅˆถๆ‰“้€ ไบ†ๅŸบไบŽๅคง่ฏญ่จ€ๆจกๅž‹็š„AIๅŠฉๆ‰‹Qwen-7B-Chatใ€‚็›ธ่พƒไบŽๆœ€ๅˆๅผ€ๆบ็š„Qwen-7Bๆจกๅž‹๏ผŒๆˆ‘ไปฌ็Žฐๅทฒๅฐ†้ข„่ฎญ็ปƒๆจกๅž‹ๅ’ŒChatๆจกๅž‹ๆ›ดๆ–ฐๅˆฐๆ•ˆๆžœๆ›ดไผ˜็š„็‰ˆๆœฌใ€‚ๆœฌไป“ๅบ“ไธบQwen-7B้ข„่ฎญ็ปƒๆจกๅž‹็š„ไป“ๅบ“ใ€‚

้€šไน‰ๅƒ้—ฎ-7B๏ผˆQwen-7B๏ผ‰ไธป่ฆๆœ‰ไปฅไธ‹็‰น็‚น๏ผš

  1. ๅคง่ง„ๆจก้ซ˜่ดจ้‡่ฎญ็ปƒ่ฏญๆ–™๏ผšไฝฟ็”จ่ถ…่ฟ‡2.4ไธ‡ไบฟtokens็š„ๆ•ฐๆฎ่ฟ›่กŒ้ข„่ฎญ็ปƒ๏ผŒๅŒ…ๅซ้ซ˜่ดจ้‡ไธญใ€่‹ฑใ€ๅคš่ฏญ่จ€ใ€ไปฃ็ ใ€ๆ•ฐๅญฆ็ญ‰ๆ•ฐๆฎ๏ผŒๆถต็›–้€š็”จๅŠไธ“ไธš้ข†ๅŸŸ็š„่ฎญ็ปƒ่ฏญๆ–™ใ€‚้€š่ฟ‡ๅคง้‡ๅฏนๆฏ”ๅฎž้ชŒๅฏน้ข„่ฎญ็ปƒ่ฏญๆ–™ๅˆ†ๅธƒ่ฟ›่กŒไบ†ไผ˜ๅŒ–ใ€‚
  2. ๅผบๅคง็š„ๆ€ง่ƒฝ๏ผšQwen-7Bๅœจๅคšไธชไธญ่‹ฑๆ–‡ไธ‹ๆธธ่ฏ„ๆต‹ไปปๅŠกไธŠ๏ผˆๆถต็›–ๅธธ่ฏ†ๆŽจ็†ใ€ไปฃ็ ใ€ๆ•ฐๅญฆใ€็ฟป่ฏ‘็ญ‰๏ผ‰๏ผŒๆ•ˆๆžœๆ˜พ่‘—่ถ…่ถŠ็Žฐๆœ‰็š„็›ธ่ฟ‘่ง„ๆจกๅผ€ๆบๆจกๅž‹๏ผŒ็”š่‡ณๅœจ้ƒจๅˆ†ๆŒ‡ๆ ‡ไธŠ็›ธๆฏ”ๆ›ดๅคงๅฐบๅฏธๆจกๅž‹ไนŸๆœ‰่พƒๅผบ็ซžไบ‰ๅŠ›ใ€‚ๅ…ทไฝ“่ฏ„ๆต‹็ป“ๆžœ่ฏท่ฏฆ่งไธ‹ๆ–‡ใ€‚
  3. ่ฆ†็›–ๆ›ดๅ…จ้ข็š„่ฏ่กจ๏ผš็›ธๆฏ”็›ฎๅ‰ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป็š„ๅผ€ๆบๆจกๅž‹๏ผŒQwen-7Bไฝฟ็”จไบ†็บฆ15ไธ‡ๅคงๅฐ็š„่ฏ่กจใ€‚่ฏฅ่ฏ่กจๅฏนๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบๅ’Œๆ‰ฉๅฑ•ใ€‚

ๅฆ‚ๆžœๆ‚จๆƒณไบ†่งฃๆ›ดๅคšๅ…ณไบŽ้€šไน‰ๅƒ้—ฎ7Bๅผ€ๆบๆจกๅž‹็š„็ป†่Š‚๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅ‚้˜…GitHubไปฃ็ ๅบ“ใ€‚

Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. Now we have updated both our pretrained and chat models for better performances. This repository is the one for the Qwen-7B base language model.

The features of Qwen-7B include:

  1. Large-scale high-quality training corpora: It is pretrained on over 2.4 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
  2. Competitive performance: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
  3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.

For more details about Qwen, please refer to the GitHub code repository.

่ฆๆฑ‚๏ผˆRequirements๏ผ‰

  • python 3.8ๅŠไปฅไธŠ็‰ˆๆœฌ
  • pytorch 1.12ๅŠไปฅไธŠ็‰ˆๆœฌ๏ผŒๆŽจ่2.0ๅŠไปฅไธŠ็‰ˆๆœฌ
  • ๅปบ่ฎฎไฝฟ็”จCUDA 11.4ๅŠไปฅไธŠ๏ผˆGPU็”จๆˆทใ€flash-attention็”จๆˆท็ญ‰้œ€่€ƒ่™‘ๆญค้€‰้กน๏ผ‰
  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)

ไพ่ต–้กน (Dependency)

่ฟ่กŒQwen-7B๏ผŒ่ฏท็กฎไฟๆปก่ถณไธŠ่ฟฐ่ฆๆฑ‚๏ผŒๅ†ๆ‰ง่กŒไปฅไธ‹pipๅ‘ฝไปคๅฎ‰่ฃ…ไพ่ต–ๅบ“

To run Qwen-7B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed

ๅฆๅค–๏ผŒๆŽจ่ๅฎ‰่ฃ…flash-attentionๅบ“๏ผˆๅฝ“ๅ‰ๅทฒๆ”ฏๆŒflash attention 2๏ผ‰๏ผŒไปฅๅฎž็Žฐๆ›ด้ซ˜็š„ๆ•ˆ็އๅ’Œๆ›ดไฝŽ็š„ๆ˜พๅญ˜ๅ ็”จใ€‚

In addition, it is recommended to install the flash-attention library (we support flash attention 2 now.) for higher efficiency and lower memory usage.

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธ‹ๆ–นๅฎ‰่ฃ…ๅฏ้€‰๏ผŒๅฎ‰่ฃ…ๅฏ่ƒฝๆฏ”่พƒ็ผ“ๆ…ขใ€‚
# pip install csrc/layer_norm
# pip install csrc/rotary

ๅฟซ้€Ÿไฝฟ็”จ๏ผˆQuickstart๏ผ‰

ๆ‚จๅฏไปฅ้€š่ฟ‡ไปฅไธ‹ไปฃ็ ่ฝปๆพ่ฐƒ็”จ๏ผš

You can easily call the model with the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

inputs = tokenizer('่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏไนŒๅ…ฐๅทดๆ‰˜๏ผˆUlaanbaatar๏ผ‰\nๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏ้›ทๅ…‹้›…ๆœชๅ…‹๏ผˆReykjavik๏ผ‰\nๅŸƒๅกžไฟ„ๆฏ”ไบš็š„้ฆ–้ƒฝๆ˜ฏ', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# ่’™ๅคๅ›ฝ็š„้ฆ–้ƒฝๆ˜ฏไนŒๅ…ฐๅทดๆ‰˜๏ผˆUlaanbaatar๏ผ‰\nๅ†ฐๅฒ›็š„้ฆ–้ƒฝๆ˜ฏ้›ทๅ…‹้›…ๆœชๅ…‹๏ผˆReykjavik๏ผ‰\nๅŸƒๅกžไฟ„ๆฏ”ไบš็š„้ฆ–้ƒฝๆ˜ฏไบš็š„ๆ–ฏไบš่ดๅทด๏ผˆAddis Ababa๏ผ‰...

ๅ…ณไบŽๆ›ดๅคš็š„ไฝฟ็”จ่ฏดๆ˜Ž๏ผŒ่ฏทๅ‚่€ƒๆˆ‘ไปฌ็š„GitHub repo่Žทๅ–ๆ›ดๅคšไฟกๆฏใ€‚

For more information, please refer to our GitHub repo for more information.

Tokenizer

ๆณจ๏ผšไฝœไธบๆœฏ่ฏญ็š„โ€œtokenizationโ€ๅœจไธญๆ–‡ไธญๅฐšๆ— ๅ…ฑ่ฏ†็š„ๆฆ‚ๅฟตๅฏนๅบ”๏ผŒๆœฌๆ–‡ๆกฃ้‡‡็”จ่‹ฑๆ–‡่กจ่พพไปฅๅˆฉ่ฏดๆ˜Žใ€‚

ๅŸบไบŽtiktoken็š„ๅˆ†่ฏๅ™จๆœ‰ๅˆซไบŽๅ…ถไป–ๅˆ†่ฏๅ™จ๏ผŒๆฏ”ๅฆ‚sentencepieceๅˆ†่ฏๅ™จใ€‚ๅฐคๅ…ถๅœจๅพฎ่ฐƒ้˜ถๆฎต๏ผŒ้œ€่ฆ็‰นๅˆซๆณจๆ„็‰นๆฎŠtoken็š„ไฝฟ็”จใ€‚ๅ…ณไบŽtokenizer็š„ๆ›ดๅคšไฟกๆฏ๏ผŒไปฅๅŠๅพฎ่ฐƒๆ—ถๆถ‰ๅŠ็š„็›ธๅ…ณไฝฟ็”จ๏ผŒ่ฏทๅ‚้˜…ๆ–‡ๆกฃใ€‚

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.

ๆจกๅž‹็ป†่Š‚ (Model)

Qwen-7Bๆจกๅž‹่ง„ๆจกๅŸบๆœฌๆƒ…ๅ†ตๅฆ‚ไธ‹ๆ‰€็คบใ€‚

The details of the model architecture of Qwen-7B are listed as follows.

Hyperparameter Value
n_layers 32
n_heads 32
d_model 4096
vocab size 151851
sequence length 8192

ๅœจไฝ็ฝฎ็ผ–็ ใ€FFNๆฟ€ๆดปๅ‡ฝๆ•ฐๅ’Œnormalization็š„ๅฎž็Žฐๆ–นๅผไธŠ๏ผŒๆˆ‘ไปฌไนŸ้‡‡็”จไบ†็›ฎๅ‰ๆœ€ๆต่กŒ็š„ๅšๆณ•๏ผŒ ๅณRoPE็›ธๅฏนไฝ็ฝฎ็ผ–็ ใ€SwiGLUๆฟ€ๆดปๅ‡ฝๆ•ฐใ€RMSNorm๏ผˆๅฏ้€‰ๅฎ‰่ฃ…flash-attentionๅŠ ้€Ÿ๏ผ‰ใ€‚

ๅœจๅˆ†่ฏๅ™จๆ–น้ข๏ผŒ็›ธๆฏ”็›ฎๅ‰ไธปๆตๅผ€ๆบๆจกๅž‹ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป๏ผŒQwen-7Bไฝฟ็”จไบ†่ถ…่ฟ‡15ไธ‡tokenๅคงๅฐ็š„่ฏ่กจใ€‚ ่ฏฅ่ฏ่กจๅœจGPT-4ไฝฟ็”จ็š„BPE่ฏ่กจcl100k_baseๅŸบ็ก€ไธŠ๏ผŒๅฏนไธญๆ–‡ใ€ๅคš่ฏญ่จ€่ฟ›่กŒไบ†ไผ˜ๅŒ–๏ผŒๅœจๅฏนไธญใ€่‹ฑใ€ไปฃ็ ๆ•ฐๆฎ็š„้ซ˜ๆ•ˆ็ผ–่งฃ็ ็š„ๅŸบ็ก€ไธŠ๏ผŒๅฏน้ƒจๅˆ†ๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบใ€‚ ่ฏ่กจๅฏนๆ•ฐๅญ—ๆŒ‰ๅ•ไธชๆ•ฐๅญ—ไฝๅˆ‡ๅˆ†ใ€‚่ฐƒ็”จ่พƒไธบ้ซ˜ๆ•ˆ็š„tiktokenๅˆ†่ฏๅบ“่ฟ›่กŒๅˆ†่ฏใ€‚

ๆˆ‘ไปฌไปŽ้ƒจๅˆ†่ฏญ็งๅ„้šๆœบๆŠฝๅ–100ไธ‡ไธชๆ–‡ๆกฃ่ฏญๆ–™๏ผŒไปฅๅฏนๆฏ”ไธๅŒๆจกๅž‹็š„็ผ–็ ๅŽ‹็ผฉ็އ๏ผˆไปฅๆ”ฏๆŒ100่ฏญ็ง็š„XLM-RไธบๅŸบๅ‡†ๅ€ผ1๏ผŒ่ถŠไฝŽ่ถŠๅฅฝ๏ผ‰๏ผŒๅ…ทไฝ“ๆ€ง่ƒฝ่งๅ›พใ€‚

ๅฏไปฅ็œ‹ๅˆฐQwen-7BๅœจไฟๆŒไธญ่‹ฑไปฃ็ ้ซ˜ๆ•ˆ่งฃ็ ็š„ๅ‰ๆไธ‹๏ผŒๅฏน้ƒจๅˆ†ไฝฟ็”จไบบ็พค่พƒๅคš็š„่ฏญ็ง๏ผˆๆณฐ่ฏญthใ€ๅธŒไผฏๆฅ่ฏญheใ€้˜ฟๆ‹‰ไผฏ่ฏญarใ€้Ÿฉ่ฏญkoใ€่ถŠๅ—่ฏญviใ€ๆ—ฅ่ฏญjaใ€ๅœŸ่€ณๅ…ถ่ฏญtrใ€ๅฐๅฐผ่ฏญidใ€ๆณขๅ…ฐ่ฏญplใ€ไฟ„่ฏญruใ€่ทๅ…ฐ่ฏญnlใ€่‘ก่„็‰™่ฏญptใ€ๆ„ๅคงๅˆฉ่ฏญitใ€ๅพท่ฏญdeใ€่ฅฟ็ญ็‰™่ฏญesใ€ๆณ•่ฏญfr็ญ‰๏ผ‰ไธŠไนŸๅฎž็Žฐไบ†่พƒ้ซ˜็š„ๅŽ‹็ผฉ็އ๏ผŒไฝฟๅพ—ๆจกๅž‹ๅœจ่ฟ™ไบ›่ฏญ็งไธŠไนŸๅ…ทๅค‡่พƒๅผบ็š„ๅฏๆ‰ฉๅฑ•ๆ€งๅ’Œ่พƒ้ซ˜็š„่ฎญ็ปƒๅ’ŒๆŽจ็†ๆ•ˆ็އใ€‚

ๅœจ้ข„่ฎญ็ปƒๆ•ฐๆฎๆ–น้ข๏ผŒๅŽป้‡ๅŠ่ฟ‡ๆปคๅŽ็š„่ฏญๆ–™่ถ…่ฟ‡2.4T tokens๏ผŒๅ›Šๆ‹ฌๅ…จ็ฝ‘ๆ–‡ๆœฌใ€็™พ็ง‘ใ€ไนฆ็ฑใ€ไปฃ็ ใ€ๆ•ฐๅญฆๅŠๅ„ไธช้ข†ๅŸŸๅž‚็ฑปใ€‚

For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.

We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above.

As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

The scale of pretraining corpus reaches over 2.4T tokens after deduplication and filtration, encompassing web text, encyclopedia, books, code, mathematics, and various domains.

่ฏ„ๆต‹ๆ•ˆๆžœ๏ผˆEvaluation๏ผ‰

ๆˆ‘ไปฌ้€‰ๅ–ไบ†MMLU๏ผŒC-Eval๏ผŒGSM8K, MATH, HumanEval, MBPP, BBH, CMMLU็ญ‰็›ฎๅ‰่พƒๆต่กŒ็š„benchmark๏ผŒๅฏนๆจกๅž‹็š„ไธญ่‹ฑ็Ÿฅ่ฏ†่ƒฝๅŠ›ใ€็ฟป่ฏ‘ใ€ๆ•ฐๅญฆๆŽจ็†ใ€ไปฃ็ ็ญ‰่ƒฝๅŠ›่ฟ›่กŒ็ปผๅˆ่ฏ„ๆต‹ใ€‚ไปŽไธ‹ๅˆ—็ป“ๆžœๅฏไปฅ็œ‹ๅˆฐQwenๆจกๅž‹ๅœจๆ‰€ๆœ‰benchmarkไธŠๅ‡ๅ–ๅพ—ไบ†ๅŒ็บงๅˆซๅผ€ๆบๆจกๅž‹ไธญ็š„ๆœ€ไผ˜่กจ็Žฐใ€‚

We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the modelโ€™s Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks.

Model MMLU C-Eval GSM8K MATH HumanEval MBPP BBH CMMLU
5-shot 5-shot 8-shot 4-shot 0-shot 3-shot 3-shot 5-shot
LLaMA2-7B 46.8 32.5 16.7 3.3 12.8 20.8 38.2 31.8
LLaMA2-13B 55.0 41.4 29.6 5.0 18.9 30.3 45.6 38.4
LLaMA2-34B 62.6 - 42.2 6.2 22.6 33.0 44.1 -
ChatGLM2-6B 47.9 51.7 32.4 6.5 - - 33.7 -
InternLM-7B 51.0 53.4 31.2 6.3 10.4 14.0 37.0 51.8
InternLM-20B 62.1 58.8 52.6 7.9 25.6 35.6 52.5 59.0
Baichuan2-7B 54.7 56.3 24.6 5.6 18.3 24.2 41.6 57.1
Baichuan2-13B 59.5 59.0 52.8 10.1 17.1 30.2 49.0 62.0
Qwen-7B (original) 56.7 59.6 51.6 - 24.4 31.2 40.6 58.8
Qwen-7B 58.2 63.5 51.7 11.6 29.9 31.6 45.0 62.2
Qwen-14B 66.3 72.1 61.3 24.8 32.3 40.8 53.4 71.0

้•ฟๅบๅˆ—่ฏ„ๆต‹๏ผˆLong-Context Evaluation๏ผ‰

ๆˆ‘ไปฌๅผ•ๅ…ฅNTKๆ’ๅ€ผ๏ผŒLogNๆณจๆ„ๅŠ›็ผฉๆ”พ๏ผŒ็ช—ๅฃๆณจๆ„ๅŠ›็ญ‰ๆŠ€ๅทง๏ผŒๅฐ†Qwen-7B (original)ๅ’Œ14Bๆจกๅž‹็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆไปŽ2Kๆ‰ฉๅฑ•ๅˆฐ8KไปฅไธŠ๏ผŒๅฐ†Qwen-7BไปŽ8Kๆ‰ฉๅˆฐ32Kใ€‚ๅœจarXivๆ•ฐๆฎไธŠไฝฟ็”จPPLๆŒ‡ๆ ‡ๆต‹่ฏ•Qwen-7Bๅ’ŒQwen-14BๅœจไธๅŒ้•ฟๅบฆไธ‹็š„่กจ็Žฐ๏ผŒ็ป“ๆžœๅฆ‚ไธ‹๏ผš

(่‹ฅ่ฆๅฏ็”จNTKๅ’ŒLogNๆณจๆ„ๅŠ›็ผฉๆ”พ๏ผŒ่ฏทๅฐ†config.json้‡Œ็š„use_dynamic_ntkๅ’Œuse_logn_attn่ฎพ็ฝฎไธบtrue)

We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below:

(To use NTK interpolation and LogN scaling, please set use_dynamic_ntk and use_long_attn to true in config.json.)

ModelSequence Length
10242048409681921638432768
Qwen-7B (original)4.233.7839.35469.812645.09-
+ dynamic_ntk4.233.783.593.665.71-
+ dynamic_ntk + logn4.233.783.583.564.62-
+ dynamic_ntk + logn + window_attn4.233.783.583.494.32-
Qwen-7B4.233.813.523.317.27181.49
+ dynamic_ntk + logn + window_attn4.233.813.523.333.223.17
Qwen-14B-3.4622.79334.653168.35-
+ dynamic_ntk + logn + window_attn-3.463.293.183.42-

่ฏ„ๆต‹ๅค็Žฐ๏ผˆReproduction๏ผ‰

ๆˆ‘ไปฌๆไพ›ไบ†่ฏ„ๆต‹่„šๆœฌ๏ผŒๆ–นไพฟๅคงๅฎถๅค็Žฐๆจกๅž‹ๆ•ˆๆžœ๏ผŒ่ฏฆ่ง้“พๆŽฅใ€‚ๆ็คบ๏ผš็”ฑไบŽ็กฌไปถๅ’Œๆก†ๆžถ้€ ๆˆ็š„่ˆๅ…ฅ่ฏฏๅทฎ๏ผŒๅค็Žฐ็ป“ๆžœๅฆ‚ๆœ‰ๅฐๅน…ๆณขๅŠจๅฑžไบŽๆญฃๅธธ็Žฐ่ฑกใ€‚

We have provided evaluation scripts to reproduce the performance of our model, details as link.

FAQ

ๅฆ‚้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆ•ฌ่ฏทๆŸฅ้˜…FAQไปฅๅŠissueๅŒบ๏ผŒๅฆ‚ไปๆ— ๆณ•่งฃๅ†ณๅ†ๆไบคissueใ€‚

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

ๅผ•็”จ (Citation)

ๅฆ‚ๆžœไฝ ่ง‰ๅพ—ๆˆ‘ไปฌ็š„ๅทฅไฝœๅฏนไฝ ๆœ‰ๅธฎๅŠฉ๏ผŒๆฌข่ฟŽๅผ•็”จ๏ผ

If you find our work helpful, feel free to give us a cite.

@article{qwen,
  title={Qwen Technical Report},
  author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
  journal={arXiv preprint arXiv:2309.16609},
  year={2023}
}

ไฝฟ็”จๅ่ฎฎ๏ผˆLicense Agreement๏ผ‰

ๆˆ‘ไปฌ็š„ไปฃ็ ๅ’Œๆจกๅž‹ๆƒ้‡ๅฏนๅญฆๆœฏ็ ”็ฉถๅฎŒๅ…จๅผ€ๆ”พ๏ผŒๅนถๆ”ฏๆŒๅ•†็”จใ€‚่ฏทๆŸฅ็œ‹LICENSEไบ†่งฃๅ…ทไฝ“็š„ๅผ€ๆบๅ่ฎฎ็ป†่Š‚ใ€‚ๅฆ‚้œ€ๅ•†็”จ๏ผŒ่ฏทๅกซๅ†™้—ฎๅท็”ณ่ฏทใ€‚

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.

่”็ณปๆˆ‘ไปฌ๏ผˆContact Us๏ผ‰

ๅฆ‚ๆžœไฝ ๆƒณ็ป™ๆˆ‘ไปฌ็š„็ ”ๅ‘ๅ›ข้˜Ÿๅ’Œไบงๅ“ๅ›ข้˜Ÿ็•™่จ€๏ผŒๆฌข่ฟŽๅŠ ๅ…ฅๆˆ‘ไปฌ็š„ๅพฎไฟก็พคใ€้’‰้’‰็พคไปฅๅŠDiscord๏ผๅŒๆ—ถ๏ผŒไนŸๆฌข่ฟŽ้€š่ฟ‡้‚ฎไปถ๏ผˆqianwen_opensource@alibabacloud.com๏ผ‰่”็ณปๆˆ‘ไปฌใ€‚

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.

Downloads last month
20,639
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Qwen/Qwen-7B

Adapters
3 models
Finetunes
3 models
Quantizations
4 models

Spaces using Qwen/Qwen-7B 51

Collection including Qwen/Qwen-7B

Paper for Qwen/Qwen-7B