neukg
/

TechGPT-7B

@@ -9,49 +9,36 @@ language:
 ---
 # TechGPT
 Demo: [TechGPT-neukg](http://techgpt.neukg.com)
 Github: [GitHub - neukg/TechGPT](https://github.com/neukg/TechGPT)
 ## 简介
 TechGPT主要强化了如下三类任务:
 - 以“知识图谱构建”为核心的关系三元组抽取等各类信息抽取任务
 - 以“阅读理解”为核心的各类智能问答任务。
 - 以“文本理解”为核心的关键词生成等各类序列生成任务。
 在这三大自然语言处理核心能力之内，TechGPT还具备了对计算机科学、材料、机械、冶金、金融和航空航天等十余种垂直专业领域自然语言文本的处理能力。
 目前，TechGPT通过提示和指令输入方式的不同，支持单轮对话和多轮对话，涵盖了领域术语抽取、命名实体识别、关系三元组抽取、文本关键词生成、标题生成摘要、摘要生成标题、文本领域识别、机器阅读理解、基础常识问答、基于上下文的知识问答、建议咨询类问答、文案生成、中英互译和简单代码生成等多项自然语言理解和生成能力。
 ---
 TechGPT mainly strengthens the following three types of tasks:
 - Various information extraction tasks such as relation triplet extraction with "knowledge graph construction" as the core
 - Various intelligent question-and-answer tasks centered on "reading comprehension".
 - Various sequence generation tasks such as keyword generation with "text understanding" as the core.
 Within these three core natural language processing capabilities, TechGPT also has the ability to process natural language texts in more than ten vertical professional fields such as computer science, materials, machinery, metallurgy, finance, and aerospace.
 TechGPT supports single-round dialogue and multi-round dialogue through different input methods of prompts and instructions, covering domain term extraction, named entity recognition, relation triplet extraction, text keyword generation, title generation abstract, abstract generation title, text field Recognition, machine reading comprehension, basic common sense question and answer, context-based knowledge question and answer, suggestion consultation question and answer, copy generation, CM-EN translation and simple code generation and many other natural language understanding and generation capabilities.
 ## 模型功能 Model
 模型在保留基本的对话和翻译能力之外，重点强化了基于**TechKG**的数据内容，因此，**TechGPT**更擅长在计算机科学等专业领域，完成实体及实体类型抽取、关系三元组抽取、标题扩写、摘要缩写、关键词生成和机器阅读理解等任务。
 下面展示了TechGPT的几个核心能力。
 1. **命名实体识别**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>抽取出下面文本的实体和实体类型：<br/>《女人树》，国产电视剧，由导演田迪执导，根据作家子页的原著改编，故事从1947年开始，跨越了解放战争和建国初期两大历史时期，展现了战斗在隐形战线上的人民英雄是如何不惧怕任何危险，不计较个人牺牲，甚至不顾人民内部的误解和生死裁决，都不暴露个人真实身份，至死不渝，与敌人周旋到底的英雄故事。</td>     <td>实体名: 女人树， 实体类型：影视作品<br/>实体名: 田迪， 实体类型：人物<br/>实体名: 子页， 实体类型：人物<br/>实体名: 1947年， 实体类型：Date</td> </tr>
 </table>
 2. **关系三元组抽取**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>
 请抽取出下面的关系三元组：<br>陈奕迅（Eason Chan），1974年7月27日出生于香港，中国香港流行乐男歌手、演员，毕业于英国金斯顿大学。     1995年因获得第14届新秀歌唱大赛冠军而正式出道。2005年发行的歌曲《浮夸》成为其歌唱生涯的代表作品之一。</td>     <td><陈奕迅, 出生日期, 1974年><br><陈奕迅, 出生地, 香港><br>
@@ -60,41 +47,32 @@ TechGPT supports single-round dialogue and multi-round dialogue through differen
 </table>
 3. **标题扩写**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请把下列标题扩写成摘要, 不少于100字: 基于视觉语言多模态的实体关系联合抽取的研究。</td>     <td>本文介绍了一种基于视觉语言的多模态实体关系联合抽取出方法。该方法利用了图像和文本之间的语义联系，通过将图像中的物体与相应的文本描述进行匹配来识别实体之间的关系。同时，本文还提出了一种新的模型结构——深度双向编码器-解码器网络（BiDAN），用于实现这种联合提取任务。实验结果表���，所提出的方法在多个数据集上取得了较好的性能表现，证明了其有效性和实用性。</td> </tr>
 </table>
 4. **摘要缩写**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请把下列摘要缩写成标题:本文介绍了一种基于视觉语言的多模态实体关系联合抽取出方法。该方法利用了图像和文本之间的语义联系，通过将图像中的物体与相应的文本描述进行匹配来识别实体之间的关系。同时，本文还提出了一种新的模型结构——深度双向编码器-解码器网络（BiDAF），用于实现这种联合提取任务。实验结果表明，所提出的方法在多个数据集上取得了较好的性能表现，证明了其有效性和实用性。</td>     <td>基于视觉语言的多模态实体关系联合抽取研究。</td> </tr>
 </table>
 5. **关键词生成**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请提取下面文本中的关键词。<br>本体是一种重要的知识库,其包含的丰富的语义信息可以为问答系统、信息检索、语义Web、信息抽取等领域的研究及相关应用提供重要的支持.因而,如何快速有效地构建本体具有非常重要的研究价值.研究者们分别从不同角度提出了大量有效地进行本体构建的方法.一般来讲,这些本体构建方法可以分为手工构建的方法和采用自动、半自动技术构建的方法.手工本体的方法往往需要本体专家参与到构建的整个过程,存在着构建成本高、效率低下、主观性强、移植不便等缺点,因而,此类方法正逐步被大量基于自动、半自动技术的本体构建方法所代替.自动、半自动构建的方法不需要（或仅需少量）人工参与,可以很方便地使用其它研究领域（如机器学习、自然语言处理等）的最新研究成果,也可以方便地使用不同数据源进行本体构建.</td>     <td>本体，知识库，问答系统，信息检索，语义 Web，信息抽取，统计主导，语言分析主导，术语抽取，概念抽取，关系抽取，本体形成，评价方法，本体构建系统</td> </tr>
 </table>
 6. **机器阅读理解**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请问这起交通事故是谁的责任居多?<br>小车和摩托车发生事故，在无红绿灯的十字路口，小停车看看左右，在觉得安全的情况下刹车慢慢以时速10公里左右的速度靠右行驶过路口，好没有出到十字路口正中时，被左边突然快速行驶过来的摩托车撞在车头前，  摩托车主摔到膝盖和檫伤脸部，请问这起交通事故是谁的责任居多。如果双方都有责任的话，大概各占几成？~\r",   "answer": "通过没有信号控制的十字路口，应该减速慢性，让右边的车先行，按你说的，摩托车好像在汽车的左边，所以严格来说可能摩托车全责。当然还要看汽车是否证照齐全，是否饮酒等。具体由交警调查后认定。</td>     <td>根据您提供的信息，我们可以初步判断该交通事故中，摩托车司机可能承担主要责任。因为根据交通规则，在没有信号控制的地方，车辆应当按照右侧先行的原则行驶，而摩托车驾驶员似乎没有遵守这一规定，导致与汽车相撞。但是需要注意的是，具体的责任划分还需要依据现场勘查、证人证言以及相关法律法规来确定。建议您在遇到类似情况时，及时报警并等待交警处理，不要自行决定赔偿或离开现场。</td> </tr>
 </table>
 ## 准备工作 Preparation
 由于LLaMA开源协议的限制，本模型仅限于研究和学习用途使用。请严格遵守LLaMA的使用要求和规范。为了确保这一点，我们需要确认您具有LLaMA的原始权重，并来自完全合法渠道。
 ---
 According to the limitations of the LLaMA open source agreement, this model is limited to research and learning purposes. Please strictly abide by the usage requirements and specifications of LLaMA. To ensure this, we need to confirm that you have LLaMA's original weight and come from a completely legitimate source.
 1. 你需要先下载模型到本地，并校验它们的检查和：
 ```
 md5sum ./*
 6b2b545ff7bacaeec6297198b4b745dd  ./config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
@@ -106,7 +84,6 @@ e765a7740a908b5e166e95b6ee09b94b  ./tokenizer.model.e3b0c44298fc1c149afbf4c8996f
 ```
 2. 根据这里→的[指定脚本](https://github.com/neukg/TechGPT/blob/main/utils/decrypt.py)解码模型权重：
 ```shell
 for file in $(ls /path/encrypt_weight); do
   python decrypt.py --type decrypt \
@@ -117,9 +94,7 @@ done
 ```
 请将 `/path/encrypt_weight`替换为你下载的加密文件目录，把`/path/to_original_llama_7B`替换为你已有的合法LLaMA-7B权重目录，里面应该有原LLaMA权重文件`consolidated.00.pth`，将 `/path/to_finetuned_model` 替换为你要存放解码后文件的目录。
 在解码完成后，应该可以得到以下文件：
 ```shell
 ./config.json
 ./generation_config.json
@@ -130,7 +105,6 @@ done
 ```
 3. 请检查所有文件的检查和是否和下面给出的相同, 以保证解码出正确的文件：
 ```
 md5sum ./*
 6d5f0d60a6e36ebc1518624a46f5a717  ./config.json
@@ -140,11 +114,8 @@ md5sum ./*
 08f6f621dba90b2a23c6f9f7af974621  ./tokenizer_config.json
 6ffe559392973a92ea28032add2a8494  ./tokenizer.model
 ```
 ---
 1. Git clone this model first.
 ```
 md5sum ./*
 6b2b545ff7bacaeec6297198b4b745dd  ./config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
@@ -154,14 +125,11 @@ md5sum ./*
 953dceae026a7aa88e062787c61ed9b0  ./tokenizer_config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 e765a7740a908b5e166e95b6ee09b94b  ./tokenizer.model.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 ```
 2. Decrypt the files using the scripts in https://github.com/neukg/TechGPT/blob/main/utils/decrypt.py
 You can use the following command in Bash.
 Please replace `/path/to_encrypted` with the path where you stored your encrypted file,
 replace `/path/to_original_llama_7B` with the path where you stored your original LLaMA-7B file `consolidated.00.pth`,
 and replace `/path/to_finetuned_model` with the path where you want to save your final trained model.
 ```bash
 for file in $(ls /path/encrypt_weight); do
   python decrypt.py --type decrypt \
@@ -170,9 +138,7 @@ for file in $(ls /path/encrypt_weight); do
     --key_file /path/to_original_llama_7B/consolidated.00.pth
 done
 ```
 After executing the aforementioned command, you will obtain the following files.
 ```
 ./config.json
 ./generation_config.json
@@ -181,12 +147,9 @@ After executing the aforementioned command, you will obtain the following files.
 ./tokenizer_config.json
 ./tokenizer.model
 ```
 3. Check md5sum
 You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
 Here are the MD5 checksums for the relevant files:
 ```
 md5sum ./*
 6d5f0d60a6e36ebc1518624a46f5a717  ./config.json
@@ -199,7 +162,6 @@ md5sum ./*
 ## 使用方法 Model Usage
 请注意在**训练**和**推理**阶段, 模型接收的输入格式是一致的：
 Please note that the input should be formatted as follows in both **training** and **inference**.

 ---
 # TechGPT
 Demo: [TechGPT-neukg](http://techgpt.neukg.com)
 Github: [GitHub - neukg/TechGPT](https://github.com/neukg/TechGPT)
 ## 简介
 TechGPT主要强化了如下三类任务:
 - 以“知识图谱构建”为核心的关系三元组抽取等各类信息抽取任务
 - 以“阅读理解”为核心的各类智能问答任务。
 - 以“文本理解”为核心的关键词生成等各类序列生成任务。
 在这三大自然语言处理核心能力之内，TechGPT还具备了对计算机科学、材料、机械、冶金、金融和航空航天等十余种垂直专业领域自然语言文本的处理能力。
 目前，TechGPT通过提示和指令输入方式的不同，支持单轮对话和多轮对话，涵盖了领域术语抽取、命名实体识别、关系三元组抽取、文本关键词生成、标题生成摘要、摘要生成标题、文本领域识别、机器阅读理解、基础常识问答、基于上下文的知识问答、建议咨询类问答、文案生成、中英互译和简单代码生成等多项自然语言理解和生成能力。
 ---
 TechGPT mainly strengthens the following three types of tasks:
 - Various information extraction tasks such as relation triplet extraction with "knowledge graph construction" as the core
 - Various intelligent question-and-answer tasks centered on "reading comprehension".
 - Various sequence generation tasks such as keyword generation with "text understanding" as the core.
 Within these three core natural language processing capabilities, TechGPT also has the ability to process natural language texts in more than ten vertical professional fields such as computer science, materials, machinery, metallurgy, finance, and aerospace.
 TechGPT supports single-round dialogue and multi-round dialogue through different input methods of prompts and instructions, covering domain term extraction, named entity recognition, relation triplet extraction, text keyword generation, title generation abstract, abstract generation title, text field Recognition, machine reading comprehension, basic common sense question and answer, context-based knowledge question and answer, suggestion consultation question and answer, copy generation, CM-EN translation and simple code generation and many other natural language understanding and generation capabilities.
 ## 模型功能 Model
 模型在保留基本的对话和翻译能力之外，重点强化了基于**TechKG**的数据内容，因此，**TechGPT**更擅长在计算机科学等专业领域，完成实体及实体类型抽取、关系三元组抽取、标题扩写、摘要缩写、关键词生成和机器阅读理解等任务。
 下面展示了TechGPT的几个核心能力。
 1. **命名实体识别**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>抽取出下面文本的实体和实体类型：<br/>《女人树》，国产电视剧，由导演田迪执导，根据作家子页的原著改编，故事从1947年开始，跨越了解放战争和建国初期两大历史时期，展现了战斗在隐形战线上的人民英雄是如何不惧怕任何危险，不计较个人牺牲，甚至不顾人民内部的误解和生死裁决，都不暴露个人真实身份，至死不渝，与敌人周旋到底的英雄故事。</td>     <td>实体名: 女人树， 实体类型：影视作品<br/>实体名: 田迪， 实体类型：人物<br/>实体名: 子页， 实体类型：人物<br/>实体名: 1947年， 实体类型：Date</td> </tr>
 </table>
 2. **关系三元组抽取**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>
 请抽取出下面的关系三元组：<br>陈奕迅（Eason Chan），1974年7月27日出生于香港，中国香港流行乐男歌手、演员，毕业于英国金斯顿大学。     1995年因获得第14届新秀歌唱大赛冠军而正式出道。2005年发行的歌曲《浮夸》成为其歌唱生涯的代表作品之一。</td>     <td><陈奕迅, 出生日期, 1974年><br><陈奕迅, 出生地, 香港><br>
 </table>
 3. **标题扩写**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请把下列标题扩写成摘要, 不少于100字: 基于视觉语言多模态的实体关系联合抽取的研究。</td>     <td>本文介绍了一种基于视觉语言的多模态实体关系联合抽取出方法。该方法利用了图像和文本之间的语义联系，通过将图像中的物体与相应的文本描述进行匹配来识别实体之间的关系。同时，本文还提出了一种新的模型结构——深度双向编码器-解码器网络（BiDAN），用于实现这种联合提取任务。实验结果表���，所提出的方法在多个数据集上取得了较好的性能表现，证明了其有效性和实用性。</td> </tr>
 </table>
 4. **摘要缩写**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请把下列摘要缩写成标题:本文介绍了一种基于视觉语言的多模态实体关系联合抽取出方法。该方法利用了图像和文本之间的语义联系，通过将图像中的物体与相应的文本描述进行匹配来识别实体之间的关系。同时，本文还提出了一种新的模型结构——深度双向编码器-解码器网络（BiDAF），用于实现这种联合提取任务。实验结果表明，所提出的方法在多个数据集上取得了较好的性能表现，证明了其有效性和实用性。</td>     <td>基于视觉语言的多模态实体关系联合抽取研究。</td> </tr>
 </table>
 5. **关键词生成**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请提取下面文本中的关键词。<br>本体是一种重要的知识库,其包含的丰富的语义信息可以为问答系统、信息检索、语义Web、信息抽取等领域的研究及相关应用提供重要的支持.因而,如何快速有效地构建本体具有非常重要的研究价值.研究者们分别从不同角度提出了大量有效地进行本体构建的方法.一般来讲,这些本体构建方法可以分为手工构建的方法和采用自动、半自动技术构建的方法.手工本体的方法往往需要本体专家参与到构建的整个过程,存在着构建成本高、效率低下、主观性强、移植不便等缺点,因而,此类方法正逐步被大量基于自动、半自动技术的本体构建方法所代替.自动、半自动构建的方法不需要（或仅需少量）人工参与,可以很方便地使用其它研究领域（如机器学习、自然语言处理等）的最新研究成果,也可以方便地使用不同数据源进行本体构建.</td>     <td>本体，知识库，问答系统，信息检索，语义 Web，信息抽取，统计主导，语言分析主导，术语抽取，概念抽取，关系抽取，本体形成，评价方法，本体构建系统</td> </tr>
 </table>
 6. **机器阅读理解**
 <table>   <colgroup>     <col style="width:60%">     <col style="width:40%">   </colgroup>   <tr>     <td>问题</td>     <td>回复</td> </tr>
   <tr>     <td>请问这起交通事故是谁的责任居多?<br>小车和摩托车发生事故，在无红绿灯的十字路口，小停车看看左右，在觉得安全的情况下刹车慢慢以时速10公里左右的速度靠右行驶过路口，好没有出到十字路口正中时，被左边突然快速行驶过来的摩托车撞在车头前，  摩托车主摔到膝盖和檫伤脸部，请问这起交通事故是谁的责任居多。如果双方都有责任的话，大概各占几成？~\r",   "answer": "通过没有信号控制的十字路口，应该减速慢性，让右边的车先行，按你说的，摩托车好像在汽车的左边，所以严格来说可能摩托车全责。当然还要看汽车是否证照齐全，是否饮酒等。具体由交警调查后认定。</td>     <td>根据您提供的信息，我们可以初步判断该交通事故中，摩托车司机可能承担主要责任。因为根据交通规则，在没有信号控制的地方，车辆应当按照右侧先行的原则行驶，而摩托车驾驶员似乎没有遵守这一规定，导致与汽车相撞。但是需要注意的是，具体的责任划分还需要依据现场勘查、证人证言以及相关法律法规来确定。建议您在遇到类似情况时，及时报警并等待交警处理，不要自行决定赔偿或离开现场。</td> </tr>
 </table>
 ## 准备工作 Preparation
 由于LLaMA开源协议的限制，本模型仅限于研究和学习用途使用。请严格遵守LLaMA的使用要求和规范。为了确保这一点，我们需要确认您具有LLaMA的原始权重，并来自完全合法渠道。
 ---
 According to the limitations of the LLaMA open source agreement, this model is limited to research and learning purposes. Please strictly abide by the usage requirements and specifications of LLaMA. To ensure this, we need to confirm that you have LLaMA's original weight and come from a completely legitimate source.
 1. 你需要先下载模型到本地，并校验它们的检查和：
 ```
 md5sum ./*
 6b2b545ff7bacaeec6297198b4b745dd  ./config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 ```
 2. 根据这里→的[指定脚本](https://github.com/neukg/TechGPT/blob/main/utils/decrypt.py)解码模型权重：
 ```shell
 for file in $(ls /path/encrypt_weight); do
   python decrypt.py --type decrypt \
 ```
 请将 `/path/encrypt_weight`替换为你下载的加密文件目录，把`/path/to_original_llama_7B`替换为你已有的合法LLaMA-7B权重目录，里面应该有原LLaMA权重文件`consolidated.00.pth`，将 `/path/to_finetuned_model` 替换为你要存放解码后文件的目录。
 在解码完成后，应该可以得到以下文件：
 ```shell
 ./config.json
 ./generation_config.json
 ```
 3. 请检查所有文件的检查和是否和下面给出的相同, 以保证解码出正确的文件：
 ```
 md5sum ./*
 6d5f0d60a6e36ebc1518624a46f5a717  ./config.json
 08f6f621dba90b2a23c6f9f7af974621  ./tokenizer_config.json
 6ffe559392973a92ea28032add2a8494  ./tokenizer.model
 ```
 ---
 1. Git clone this model first.
 ```
 md5sum ./*
 6b2b545ff7bacaeec6297198b4b745dd  ./config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 953dceae026a7aa88e062787c61ed9b0  ./tokenizer_config.json.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 e765a7740a908b5e166e95b6ee09b94b  ./tokenizer.model.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.enc
 ```
 2. Decrypt the files using the scripts in https://github.com/neukg/TechGPT/blob/main/utils/decrypt.py
 You can use the following command in Bash.
 Please replace `/path/to_encrypted` with the path where you stored your encrypted file,
 replace `/path/to_original_llama_7B` with the path where you stored your original LLaMA-7B file `consolidated.00.pth`,
 and replace `/path/to_finetuned_model` with the path where you want to save your final trained model.
 ```bash
 for file in $(ls /path/encrypt_weight); do
   python decrypt.py --type decrypt \
     --key_file /path/to_original_llama_7B/consolidated.00.pth
 done
 ```
 After executing the aforementioned command, you will obtain the following files.
 ```
 ./config.json
 ./generation_config.json
 ./tokenizer_config.json
 ./tokenizer.model
 ```
 3. Check md5sum
 You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
 Here are the MD5 checksums for the relevant files:
 ```
 md5sum ./*
 6d5f0d60a6e36ebc1518624a46f5a717  ./config.json
 ## 使用方法 Model Usage
 请注意在**训练**和**推理**阶段, 模型接收的输入格式是一致的：
 Please note that the input should be formatted as follows in both **training** and **inference**.