TMElyralab
/

lyraBELLE

English

LLM

BELLE

Model card Files Files and versions

xet

Community

bigmoyan commited on May 17, 2023

Commit

3eff70a

1 Parent(s): 242366d

Create README.md

Browse files

Files changed (1) hide show

README.md +113 -0

README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+---
+license: creativeml-openrail-m
+language:
+- en
+tags:
+- LLM
+- tensorRT
+- Belle
+---
+## Model Card for lyraBelle
+lyraBelle is currently the **fastest Belle model** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
+The inference speed of lyraChatGLM has achieved **10x** acceleration upon the ealry original version. We are still working hard to further improve the performance.
+Among its main features are:
+- weights: original BELLE-7B-2M weights released by BelleGroup.
+- device: Any
+- batch_size: compiled with dynamic batch size, max batch_size = 8
+## Speed
+### test environment
+- device: Nvidia A100 40G
+- batch size: 8
+|version|speed|
+|:-:|:-:|
+|original|30 tokens/s|
+|lyraBelle|310 tokens/s|
+## Model Sources
+- **Repository:** [https://huggingface.co/BelleGroup/BELLE-7B-2M?clone=true]
+## Try Demo in 2 fast steps
+``` bash
+#step 1
+git clone https://huggingface.co/TMElyralab/lyraChatGLM
+cd lyraChatGLM
+#step 2
+docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
+```
+## Uses
+```python
+from transformers import AutoTokenizer
+from faster_chat_glm import GLM6B, FasterChatGLM
+MAX_OUT_LEN = 100
+tokenizer = AutoTokenizer.from_pretrained('./models', trust_remote_code=True)
+input_str = ["为什么我们需要对深度学习模型加速？", ]
+inputs = tokenizer(input_str, return_tensors="pt", padding=True)
+input_ids = inputs.input_ids.to('cuda:0')
+plan_path = './models/glm6b-bs8.ftm'
+# kernel for chat model.
+kernel = GLM6B(plan_path=plan_path,
+               batch_size=1,
+               num_beams=1,
+               use_cache=True,
+               num_heads=32,
+               emb_size_per_heads=128,
+               decoder_layers=28,
+               vocab_size=150528,
+               max_seq_len=MAX_OUT_LEN)
+chat = FasterChatGLM(model_dir="./models", kernel=kernel).half().cuda()
+# generate
+sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
+# de-tokenize model output to text
+res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
+print(res)
+```
+## Demo output
+### input
+为什么我们需要对深度学习模型加速? 。
+### output
+为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。
+以下是一些原因我们需要对深度学习模型加速:
+1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。
+### TODO：
+We plan to implement a FasterTransformer version to publish a much faster release. Stay tuned!
+## Citation
+``` bibtex
+@Misc{lyraChatGLM2023,
+  author =       {Kangjian Wu, Zhengtao Wang, Bin Wu},
+  title =        {lyraChatGLM: Accelerating ChatGLM by 10x+},
+  howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
+  year =         {2023}
+}
+```
+## Report bug
+- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
+- report bug with a `[bug]` mark in the title.