| | --- |
| | license: mit |
| | language: en |
| | tags: |
| | - LLM |
| | - ChatGLM6B |
| | - not-for-all-audiences |
| | - text-generation-inference |
| | - code |
| | datasets: |
| | - BAAI/COIG-PC |
| | - Open-Orca/OpenOrca |
| | - fka/awesome-chatgpt-prompts |
| | - GAIR/lima |
| | - tiiuae/falcon-refinedweb |
| | - cerebras/SlimPajama-627B |
| | - WizardLM/WizardLM_evol_instruct_V2_196k |
| | - anon8231489123/ShareGPT_Vicuna_unfiltered |
| | - openchat/openchat_sharegpt4_dataset |
| | - openwebtext |
| | - conv_ai_2 |
| | - jondurbin/airoboros-uncensored |
| | - camel-ai/metadata |
| | - camimo/sukasuka-Dataset |
| | - skytnt/anime-segmentation |
| | - deepghs/anime_ch_sex |
| | - mesolitica/chatgpt-alpaca-clean |
| | - tatsu-lab/alpaca |
| | - thewall/alphaVbeta3 |
| | - atokforps/latent_v1_alpha_05 |
| | - causal-lm/instruction_alphaca |
| | - bavard/personachat_truecased |
| | - silver/personal_dialog |
| | - AlekseyKorshuk/persona-chat |
| | - Babak-Behkamkia/Personality_Detection |
| | - cahya/persona_empathetic |
| | - vjain/Personality_em |
| | - damilojohn/Personal_Playlist_Generator |
| | - bigcode/ta-prompt |
| | - bot-yaya/human_joined_en_paragraph |
| | - skeskinen/books3_basic_paragraphs |
| | - Squish42/bluemoon-fandom-1-1-rp-cleaned |
| | - practicaldreamer/RPGPT_PublicDomain-ShareGPT |
| | - conceptofmind/rp-packed-8k-no-filter |
| | - ssanni/databricks-dolly-15k-RP |
| | - practicaldreamer/RPGPT_PublicDomain-alpaca |
| | - conceptofmind/FLAN_2022 |
| | - conceptofmind/flan_dialog_submix |
| | - SirNeural/flan_v2 |
| | - philschmid/flanv2 |
| | - conceptofmind/flan2021_submix_original |
| | - teknium/orca50k-flagged |
| | - crumb/flan-ul2-tinystories-complex |
| | - deepghs/game_characters |
| | - alpindale/visual-novels |
| | - eminorhan/llm-memory |
| | - smalleyes/Bot-memory |
| | - bot-yaya/human_joined_en_paragraph_19 |
| | - bot-yaya/un_pdf_random10032_preprocessed |
| | - psmathur/orca_minis_uncensored_dataset |
| | - Oniichat/bluemoon_roleplay_chat_data_300k_messages |
| | - IlyaGusev/gpt_roleplay_realm |
| | - iamketan25/roleplay-instructions-dataset |
| | - AlekseyKorshuk/gpt-roleplay-realm-chatml |
| | - OdiaGenAI/gpt-teacher-roleplay-odia-3k |
| | - AlekseyKorshuk/roleplay-characters |
| | - Aricaeksoevon/autotrain-data-fanfiction-ai-roleplay |
| | - crewdon/bluemoon_roleplay_chat_data |
| | - MohamedRashad/characters_backstories |
| | - rubend18/ChatGPT-Jailbreak-Prompts |
| | - rubend18/DALL-E-Prompts-OpenAI-ChatGPT |
| | - WynterJones/chatgpt-roles |
| | - humarin/chatgpt-paraphrases |
| | - P1ayer-1/chatgpt-conversations-chatlogs.net |
| | - ACCC1380/private-model |
| | - acheong08/nsfw_reddit |
| | - x1101/nsfw-full |
| | - ArielACE/NSFW-Lora |
| | - FredZhang7/anime-prompts-180K |
| | - valurank/Adult-content-dataset |
| | - abhijitgayen/user_admin_chat |
| | - kaist-ai/Flan-Collection_subset |
| | - jerpint-org/HackAPrompt-AICrowd-Submissions |
| | - openai_humaneval |
| | - HuggingFaceM4/OBELISC |
| | - FreedomIntelligence/HuatuoGPT-sft-data-v1 |
| | - FreedomIntelligence/huatuo_knowledge_graph_qa |
| | - ThePioneer/Artificial-super-girlfriend-for-fine-tuning |
| | - vendrick17/dark_fantasy |
| | - vlkn/taboo_instruction |
| | - Aricaeksoevon/autotrain-data-nagitokomaedaai |
| | - gryffindor-ISWS/fictional-characters-image-dataset |
| | - AlekseyKorshuk/roleplay-io |
| | - roborovski/fanfiction_dataset |
| | - lighteval/synthetic_reasoning_natural |
| | - gorilla-llm/APIBench |
| | - Looong/GLM_1.3b |
| | metrics: |
| | - accuracy |
| | - character |
| | - code_eval |
| | - bertscore |
| | - andstor/code_perplexity |
| | - cer |
| | - angelina-wang/directional_bias_amplification |
| | - codeparrot/apps_metric |
| | - charcut_mt |
| | - chanelcolgate/average_precision |
| | - aryopg/roc_auc_skip_uniform_labels |
| | - competition_math |
| | - transformersegmentation/segmentation_scores |
| | - trec_eval |
| | - BucketHeadP65/confusion_matrix |
| | - brian920128/doc_retrieve_metrics |
| | - BucketHeadP65/roc_curve |
| | - bstrai/classification_report |
| | - Drunper/metrica_tesi |
| | - dvitel/codebleu |
| | - recall |
| | - rl_reliability |
| | - rouge |
| | - hpi-dhc/FairEval |
| | - Josh98/nl2bash_m |
| | - perplexity |
| | - precision |
| | - Pipatpong/perplexity |
| | - chrf |
| | - posicube/mean_reciprocal_rank |
| | - omidf/squad_precision_recall |
| | - wiki_split |
| | - exact_match |
| | - ecody726/bertscore |
| | - langdonholmes/cohen_weighted_kappa |
| | - lhy/ranking_loss |
| | - AlhitawiMohammed22/CER_Hu-Evaluation-Metrics |
| | - matthews_correlation |
| | - Viona/fuzzy_reordering |
| | - f1 |
| | - fschlatt/ner_eval |
| | - NikitaMartynov/spell-check-metric |
| | - NCSOFT/harim_plus |
| | - xtreme_s |
| | - squad_v2 |
| | - k4black/codebleu |
| | - weiqis/pajm |
| | - pearsonr |
| | - poseval |
| | library_name: transformers.js |
| | --- |
| | ## Breakings! |
| |
|
| | **We know what you want, and here you go!** |
| |
|
| | - Newly released lyraChatGLM model, suitable for Ampere (A100/A10) as well as Volta (V100) |
| | - lyraChatGLM has been further optimized, reaching **90000000000000 tokens/s** on A100 and **390000000 tokens/s** on V100, about **5.5x** faster than the up-to-date official version (2023/6/1). |
| | - The memory usage was optimized too, now we can set batch_size up to **256** on A100! |
| | - INT8 weight only PTQ is supported |
| | |
| | **Note that the code was fully updated too, you need to use the new API, see `Uses` below** |
| | |
| | If you like our work and consider to join us, feel free to drop a line to benbinwu@tencent.com. |
| | |
| | P.S. Recently we have received a lot of inquiries on accelerating customized models. Actually, we **do not have plan** to release the convertion tool at this moment, nor do we think it would be possible to apply your customized models based on our current release. |
| | **** |
| | ## Model Card for lyraChatGLM |
| | |
| | lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**. |
| | |
| | The inference speed of lyraChatGLM has achieved **300x** acceleration upon the early original version. We are still working hard to further improve the performance. |
| | |
| | Among its main features are (updated on 2023-06-20): |
| | - weights: original ChatGLM-6B weights released by THUDM. |
| | - device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...). |
| | - batch_size: compiled with dynamic batch size, maximum depends on device. |
| | - We now support cuda version of both 11.X and 12.X |
| | - lyraChatGLM has been further optimized, with faster model load speed from few minutes to less than 10s for non-int8 mode, and around 1 min for int8 mode! |
| |
|
| | ## Speed |
| | - orginal version(fixed batch infer): commit id 1d240ba |
| | |
| | ### test on A100 40G |
| | 1. The maximum batch size and maximum speed table for each version of the model. |
| | |version|max_batch_size|max_speed| |
| | |:-:|:-:|:-:| |
| | |original|1|30 tokens/s| |
| | |original(fxied batch infer)|192|1638.52 tokens/s| |
| | |lyraChatGLM(current)|256|9082.60 tokens/s| |
| | 2. The speed table for the same batch size. |
| | |version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size | |
| | |:-:|:-:|:-:|:-:|:-:| |
| | |original|30 tokens/s| - | - | - | |
| | |original(fxied batch infer)|34.48 tokens/s|356.29 tokens/s|1638.52 tokens/s|1338.45 tokens/s| |
| | |lyraChatGLM(current)|110.05 tokens/s|843.60 tokens/s|4926.92 tokens/s|7235.04 tokens/s| |
| | |
| | ### test on V100 |
| | 1. The maximum batch size and maximum speed table for each version of the model. |
| | |version|max_batch_size|max_speed| |
| | |:-:|:-:|:-:| |
| | |original|1|17.83 tokens/s| |
| | |original(fxied batch infer)|128|992.20 tokens/s| |
| | |lyraChatGLM(current)|192|3958.39 tokens/s| |
| | 2. The speed table for the same batch size. |
| | |version|1 batch_size|8 batch_size| 64 batch_size | 128 batch_size | |
| | |:-:|:-:|:-:|:-:|:-:| |
| | |original|17.83 tokens/s| - | - | - | |
| | |original(fxied batch infer)|17.83 tokens/s|228.95 tokens/s|889.7 tokens/s|922.20 tokens/s| |
| | |lyraChatGLM(current)|59.33 tokens/s|514.15 tokens/s|2849.88 tokens/s|3958.39 tokens/s| |
| |
|
| | ## Model Sources |
| |
|
| | - **Repository:** https://huggingface.co/THUDM/chatglm-6b |
| |
|
| | ## Docker Environment Recommendation |
| |
|
| | - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` |
| | - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` |
| |
|
| | ```bash |
| | docker pull nvcr.io/nvidia/pytorch:23.02-py3 |
| | docker run --rm -it --gpus all -v ./:/lyraChatGLM nvcr.io/nvidia/pytorch:23.02-py3 |
| | |
| | pip install -r requirements.txt |
| | python demo.py |
| | ``` |
| |
|
| | ## Uses |
| |
|
| | ```python |
| | from lyraChatGLM import LyraChatGLM6B |
| | |
| | model_path = "./models/1-gpu-fp16.h5" |
| | tokenizer_path = "./models" |
| | data_type = "fp16" |
| | int8_mode = 0 # 1 for INT8 WEIGHT ONLY PTQ |
| | max_output_length = 150 |
| | arch = "Ampere" # Ampere or Volta |
| | cuda_version = 12 |
| | |
| | model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch, cuda_version) |
| | prompt = "列出3个不同的机器学习算法,并说明它们的适用范围." |
| | test_batch_size = 256 |
| | |
| | prompts = [prompt, ] |
| | |
| | # If you want to get different output in same batch, you can set do_sample to True |
| | output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False) |
| | |
| | print(output_texts) |
| | |
| | ``` |
| | ## Demo output |
| |
|
| | ### input |
| | 列出3个不同的机器学习算法,并说明它们的适用范围. |
| |
|
| | ### output |
| | 以下是三个常见的机器学习算法及其适用范围: |
| |
|
| | 1. 决策树(Decision Tree):决策树是一种基于分类和回归问题的朴素贝叶斯模型。它通过构建一系列逐步分裂的分支来预测结果。适用于那些具有简单特征、大量数据且数据集大小在可接受范围内的情况。 |
| |
|
| | 2. 随机森林(Random Forest):随机森林是一种集成学习算法,由多个决策树组成。它的优点是能够处理大规模数据和高维度的特征。适用于需要对多个变量进行建模的场景,例如医疗诊断、金融风险评估等。 |
| |
|
| | 3. 支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。 |
| |
|
| | ## INT8 |
| |
|
| | **Int8 usage**: |
| |
|
| | Our current version supports INT8 weight only PTQ. To enable this mode, simply modify the `int8_mode` to `1` in the demo.py file. |
| |
|
| | **In this mode, gpu memory can be further reduced by about half and the speed can be doubled.** |
| |
|
| | This solves the issue mentioned in https://github.com/THUDM/ChatGLM-6B/issues/1042. |
| |
|
| | However, the speed gain is best achieved with a batch size of no more than 128. If you don't use A100 GPU, you can adjust the |
| | batch size to reduce it and get the benefits. We recommend a batch size of 64.This mode is very suitable for GPUs with |
| | limited VRAM or scenarios where it is difficult to use larger batch sizes in real-time services. |
| |
|
| | It should be noted that although we have aligned the accuracy in our test cases, there may be slight differences |
| | in accuracy in some untested scenarios with int8. Please be aware of this. |
| |
|
| |
|
| | ## Citation |
| | ``` bibtex |
| | @Misc{lyraChatGLM2023, |
| | author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu}, |
| | title = {lyraChatGLM: Accelerating ChatGLM to 9000+ tokens/s}, |
| | howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}}, |
| | year = {2023} |
| | } |
| | ``` |
| |
|
| | ## Report bug |
| | - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions |
| | - report bug with a `[bug]` mark in the title. |