Upload folder using huggingface_hub
Browse files- README.md +36 -99
- config.json +26 -33
- special_tokens_map.json +23 -0
- tokenizer.json +0 -0
- tokenizer.model +2 -2
- tokenizer_config.json +33 -31
README.md
CHANGED
|
@@ -1,118 +1,55 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
- sentence-similarity
|
| 7 |
-
- zero-shot-classification
|
| 8 |
-
- named-entity-recognition
|
| 9 |
-
- translation
|
| 10 |
-
- nli
|
| 11 |
-
- question-answering
|
| 12 |
-
- text-error-correction
|
| 13 |
-
- text-summarization
|
| 14 |
-
- faq-question-answering
|
| 15 |
-
- text-classification
|
| 16 |
-
|
| 17 |
-
model-type:
|
| 18 |
-
- gpt
|
| 19 |
-
|
| 20 |
-
domain:
|
| 21 |
-
- nlp
|
| 22 |
-
frameworks:
|
| 23 |
-
- pytorch
|
| 24 |
-
backbone:
|
| 25 |
-
- transformer
|
| 26 |
-
tags:
|
| 27 |
-
- transformer
|
| 28 |
-
|
| 29 |
-
studios:
|
| 30 |
-
- damo/demo-polylm-multialpaca-13b
|
| 31 |
---
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
##
|
| 36 |
-
PolyLM是一个通晓多种语言的大规模语言模型,涵盖中文、英文、西班牙语、法语、德语、俄语、葡萄牙语、意大利语、阿拉伯语、日语、韩语、泰语、越南语和印尼语等18个语言。该模型可以应用于对话问答、文本生成、机器翻译和情感分析等领域,能够自动生成高质量的多语言文本,从而为跨语言、文化的交流提供便利。
|
| 37 |
|
| 38 |
-
|
| 39 |
-
Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English.
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
本项目提供了一系列不同规模和用途的模型,参数规模包括1.7B/13B版本(当前模型为13B版本),同时涵盖了预训练底座模型以及指令精调后的Chat版本(即MultiAlpaca系列)。全部版本如下表所示:
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|[PolyLM-13B](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)|bfloat16|40|40|5120|2048|6.0e-5|4M|Pretrain Model|
|
| 50 |
-
|[PolyLM-MultiAlpaca-13B](https://modelscope.cn/models/damo/nlp_polylm_multialpaca_13b_text_generation/summary)|bfloat16|40|40|5120|2048|6.0e-5|4M|Chat Model|
|
| 51 |
-
|[PolyLM-Assistant-13B](https://www.modelscope.cn/models/damo/nlp_polylm_assistant_13b_text_generation/summary)|bfloat16|40|40|5120|2048|6.0e-5|4M|Chat Model|
|
| 52 |
-
|
| 53 |
-
## 训练数据
|
| 54 |
-
|
| 55 |
-
该模型以[PolyLM-13B](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)预训练模型为底座,在以下数据上指令微调得到:
|
| 56 |
-
|
| 57 |
-
|名称|数量|构建方式|备注|
|
| 58 |
-
|---|---|---|---|
|
| 59 |
-
|[code_alpaca](https://github.com/sahil280114/codealpaca)|28|GPT 3.5 self-instruct|为了正确展示,对代码做了格式过滤,要求输入、输出中至少有一端可以找出一对```|
|
| 60 |
-
|[dolly](https://github.com/databrickslabs/dolly)|15,011|人工编写||
|
| 61 |
-
|[flan_v2](https://huggingface.co/datasets/SirNeural/flan_v2)|100,000|各类NLP任务、CoT任务|从 flan_v2中采样,全量数据非常大|
|
| 62 |
-
|[gpt4_alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) (英文)|52,002|GPT-4 self-instruct||
|
| 63 |
-
|[gpt4_alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) (中文)|48,818|GPT-4 self-instruct||
|
| 64 |
-
|[multilingual_alpaca](https://www.modelscope.cn/datasets/damo/nlp_polylm_multialpaca_sft/summary)|132,701|GPT-3.5 self-instruct||
|
| 65 |
-
|[open_assistant](https://github.com/LAION-AI/Open-Assistant)|55,668|人工编写||
|
| 66 |
-
|[share_gpt](https://github.com/domeccleston/sharegpt)|140,591|ChatGPT聊天记录||
|
| 67 |
-
|[gpteacher_codegen](https://huggingface.co/GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct)|4,535|||
|
| 68 |
-
|
| 69 |
-
## 模型下载
|
| 70 |
-
```bash
|
| 71 |
-
git lfs install
|
| 72 |
-
git clone https://www.modelscope.cn/damo/nlp_polylm_assistant_13b_text_generation.git
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
## 模型使用
|
| 76 |
-
```shell
|
| 77 |
-
pip install huggingface-hub==0.25.* transformers==4.48.3
|
| 78 |
-
```
|
| 79 |
|
| 80 |
|
| 81 |
-
|
| 82 |
-
# git clone https://github.com/modelscope/modelscope
|
| 83 |
-
# cd modelscope
|
| 84 |
-
# pip install .
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
from modelscope import snapshot_download
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
input_text = f"Beijing is the capital of China.\nTranslate this sentence from English to Chinese."
|
| 96 |
-
input_text = "<|user|>\n" + f"{input_text}\n" + "<|assistant|>\n"
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
-
|
| 105 |
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
```
|
| 110 |
-
@misc{wei2023polylm,
|
| 111 |
-
title={PolyLM: An Open Source Polyglot Large Language Model},
|
| 112 |
-
author={Xiangpeng Wei and Haoran Wei and Huan Lin and Tianhao Li and Pei Zhang and Xingzhang Ren and Mei Li and Yu Wan and Zhiwei Cao and Binbin Xie and Tianxiang Hu and Shangjie Li and Binyuan Hui and Bowen Yu and Dayiheng Liu and Baosong Yang and Fei Huang and Jun Xie},
|
| 113 |
-
year={2023},
|
| 114 |
-
eprint={2307.06018},
|
| 115 |
-
archivePrefix={arXiv},
|
| 116 |
-
primaryClass={cs.CL}
|
| 117 |
-
}
|
| 118 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- HuggingFaceFW/fineweb
|
| 5 |
+
pipeline_tag: text-generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
+
# Tiny-LLM
|
| 8 |
|
| 9 |
+
A Tiny LLM model with just 10 Million parameters, this is probably one of the small LLM arounds, and it is functional.
|
| 10 |
|
| 11 |
+
## Pretraining
|
|
|
|
| 12 |
|
| 13 |
+
Tiny-LLM was trained on 32B tokens of the Fineweb dataset, with a context length of 1024 tokens.
|
|
|
|
| 14 |
|
| 15 |
+
## Getting Started
|
| 16 |
|
| 17 |
+
To start using these models, you can simply load them via the Hugging Face `transformers` library:
|
|
|
|
| 18 |
|
| 19 |
+
```python
|
| 20 |
+
import torch
|
| 21 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
|
| 24 |
+
MODEL_NAME = "arnir0/Tiny-LLM"
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
|
| 27 |
+
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
|
|
|
|
| 28 |
|
| 29 |
+
def generate_text(prompt, model, tokenizer, max_length=512, temperature=1, top_k=50, top_p=0.95):
|
| 30 |
+
inputs = tokenizer.encode(prompt, return_tensors="pt")
|
| 31 |
|
| 32 |
+
outputs = model.generate(
|
| 33 |
+
inputs,
|
| 34 |
+
max_length=max_length,
|
| 35 |
+
temperature=temperature,
|
| 36 |
+
top_k=top_k,
|
| 37 |
+
top_p=top_p,
|
| 38 |
+
do_sample=True
|
| 39 |
+
)
|
| 40 |
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 43 |
+
return generated_text
|
| 44 |
|
| 45 |
+
def main():
|
| 46 |
+
# Define your prompt
|
| 47 |
+
prompt = "According to all known laws of aviation, there is no way a bee should be able to fly."
|
| 48 |
|
| 49 |
+
generated_text = generate_text(prompt, model, tokenizer)
|
| 50 |
|
| 51 |
+
print(generated_text)
|
| 52 |
|
| 53 |
+
if __name__ == "__main__":
|
| 54 |
+
main()
|
| 55 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -1,35 +1,28 @@
|
|
| 1 |
{
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
"summary_proj_to_labels": true,
|
| 29 |
-
"summary_type": "cls_index",
|
| 30 |
-
"summary_use_proj": true,
|
| 31 |
-
"tokenizer_class": "AutoTokenizer",
|
| 32 |
-
"transformers_version": "4.29.2",
|
| 33 |
-
"use_cache": true,
|
| 34 |
-
"vocab_size": 256000
|
| 35 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"LlamaForCausalLM"
|
| 4 |
+
],
|
| 5 |
+
"auto_map": {
|
| 6 |
+
"AutoTokenizer": "modeling_minicpmv.MiniCPMV",
|
| 7 |
+
"AutoModelForCausalLM": "modeling_minicpmv.MiniCPMV"
|
| 8 |
+
},
|
| 9 |
+
"bos_token_id": 1,
|
| 10 |
+
"eos_token_id": 2,
|
| 11 |
+
"hidden_act": "silu",
|
| 12 |
+
"hidden_size": 192,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 1024,
|
| 15 |
+
"max_position_embeddings": 1024,
|
| 16 |
+
"model_type": "llama",
|
| 17 |
+
"num_attention_heads": 2,
|
| 18 |
+
"num_hidden_layers": 1,
|
| 19 |
+
"num_key_value_heads": 1,
|
| 20 |
+
"pretraining_tp": 1,
|
| 21 |
+
"rms_norm_eps": 1e-05,
|
| 22 |
+
"rope_scaling": null,
|
| 23 |
+
"tie_word_embeddings": false,
|
| 24 |
+
"torch_dtype": "float32",
|
| 25 |
+
"transformers_version": "4.31.0.dev0",
|
| 26 |
+
"use_cache": true,
|
| 27 |
+
"vocab_size": 32000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": true,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"eos_token": {
|
| 10 |
+
"content": "</s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": true,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"unk_token": {
|
| 17 |
+
"content": "<unk>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": true,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
}
|
| 23 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
|
| 3 |
+
size 499723
|
tokenizer_config.json
CHANGED
|
@@ -1,40 +1,42 @@
|
|
| 1 |
{
|
| 2 |
-
"
|
| 3 |
-
"AutoTokenizer": [
|
| 4 |
-
"modeling_minicpmv.MiniCPMV",
|
| 5 |
-
"modeling_minicpmv.MiniCPMV"
|
| 6 |
-
],
|
| 7 |
-
"AutoModelForCausalLM": "modeling_minicpmv.MiniCPMV"
|
| 8 |
-
},
|
| 9 |
-
"add_bos_token": false,
|
| 10 |
"add_eos_token": false,
|
| 11 |
-
"
|
| 12 |
-
|
| 13 |
-
"
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
},
|
|
|
|
| 19 |
"clean_up_tokenization_spaces": false,
|
| 20 |
-
"eos_token":
|
| 21 |
-
|
| 22 |
-
"content": "</s>",
|
| 23 |
-
"lstrip": false,
|
| 24 |
-
"normalized": true,
|
| 25 |
-
"rstrip": false,
|
| 26 |
-
"single_word": false
|
| 27 |
-
},
|
| 28 |
"model_max_length": 2048,
|
| 29 |
"pad_token": null,
|
| 30 |
"sp_model_kwargs": {},
|
|
|
|
| 31 |
"tokenizer_class": "LlamaTokenizer",
|
| 32 |
-
"unk_token":
|
| 33 |
-
|
| 34 |
-
"content": "<unk>",
|
| 35 |
-
"lstrip": false,
|
| 36 |
-
"normalized": true,
|
| 37 |
-
"rstrip": false,
|
| 38 |
-
"single_word": false
|
| 39 |
-
}
|
| 40 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"add_bos_token": true,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
"add_eos_token": false,
|
| 4 |
+
"add_prefix_space": true,
|
| 5 |
+
"added_tokens_decoder": {
|
| 6 |
+
"0": {
|
| 7 |
+
"content": "<unk>",
|
| 8 |
+
"lstrip": false,
|
| 9 |
+
"normalized": true,
|
| 10 |
+
"rstrip": false,
|
| 11 |
+
"single_word": false,
|
| 12 |
+
"special": true
|
| 13 |
+
},
|
| 14 |
+
"1": {
|
| 15 |
+
"content": "<s>",
|
| 16 |
+
"lstrip": false,
|
| 17 |
+
"normalized": true,
|
| 18 |
+
"rstrip": false,
|
| 19 |
+
"single_word": false,
|
| 20 |
+
"special": true
|
| 21 |
+
},
|
| 22 |
+
"2": {
|
| 23 |
+
"content": "</s>",
|
| 24 |
+
"lstrip": false,
|
| 25 |
+
"normalized": true,
|
| 26 |
+
"rstrip": false,
|
| 27 |
+
"single_word": false,
|
| 28 |
+
"special": true
|
| 29 |
+
}
|
| 30 |
},
|
| 31 |
+
"bos_token": "<s>",
|
| 32 |
"clean_up_tokenization_spaces": false,
|
| 33 |
+
"eos_token": "</s>",
|
| 34 |
+
"legacy": true,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
"model_max_length": 2048,
|
| 36 |
"pad_token": null,
|
| 37 |
"sp_model_kwargs": {},
|
| 38 |
+
"spaces_between_special_tokens": false,
|
| 39 |
"tokenizer_class": "LlamaTokenizer",
|
| 40 |
+
"unk_token": "<unk>",
|
| 41 |
+
"use_default_system_prompt": false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
}
|