Commit
·
7313ccb
1
Parent(s):
6f38d12
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,86 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
<h1 align="center">
|
| 3 |
+
Long Bert Chinese
|
| 4 |
+
<br>
|
| 5 |
+
</h1>
|
| 6 |
+
|
| 7 |
+
<h4 align="center">
|
| 8 |
+
<p>
|
| 9 |
+
<b>简体中文</b> |
|
| 10 |
+
<a href="https://github.com/OctopusMind/long-bert-chinese/blob/main/README_EN.md">English</a>
|
| 11 |
+
</p>
|
| 12 |
+
</h4>
|
| 13 |
+
|
| 14 |
+
<p >
|
| 15 |
+
<br>
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
**Long Bert**: 长文本相似度模型,支持8192token长度。
|
| 19 |
+
基于bert-base-chinese,将原始BERT位置编码更改成ALiBi位置编码,使BERT可以支持8192的序列长度。
|
| 20 |
+
|
| 21 |
+
### News
|
| 22 |
+
* 支持`CoSENT`微调
|
| 23 |
+
* 模型已上传至 [Huggingface](https://huggingface.co/OctopusMind/LongBert)
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
### 使用
|
| 27 |
+
```python
|
| 28 |
+
from numpy.linalg import norm
|
| 29 |
+
from transformers import AutoModel
|
| 30 |
+
|
| 31 |
+
model_path = "OctopusMind/longbert-8k-zh"
|
| 32 |
+
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
|
| 33 |
+
|
| 34 |
+
sentences = ['我是问蚂蚁借呗为什么不能提前结清欠款', "为什么借呗不能选择提前还款"]
|
| 35 |
+
embeddings = model.encode(sentences)
|
| 36 |
+
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
| 37 |
+
print(cos_sim(embeddings[0], embeddings[1]))
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## 微调
|
| 41 |
+
### 数据格式
|
| 42 |
+
|
| 43 |
+
```json
|
| 44 |
+
[
|
| 45 |
+
{
|
| 46 |
+
"sentence1": "一个男人在吹一支大笛子。",
|
| 47 |
+
"sentence2": "一个人在吹长笛。",
|
| 48 |
+
"label": 3
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"sentence1": "三个人在下棋。",
|
| 52 |
+
"sentence2": "两个人在下棋。",
|
| 53 |
+
"label": 2
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"sentence1": "一个女人在写作。",
|
| 57 |
+
"sentence2": "一个女人在游泳。",
|
| 58 |
+
"label": 0
|
| 59 |
+
}
|
| 60 |
+
]
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### CoSENT 微调
|
| 64 |
+
|
| 65 |
+
至`train/`路径下
|
| 66 |
+
```bash
|
| 67 |
+
cd train/
|
| 68 |
+
```
|
| 69 |
+
进行 CoSENT 微调
|
| 70 |
+
```bash
|
| 71 |
+
python cosent_finetune.py \
|
| 72 |
+
--data_dir ../data/train_data.json \
|
| 73 |
+
--output_dir ./outputs/my-model \
|
| 74 |
+
--max_seq_length 1024 \
|
| 75 |
+
--num_epochs 10 \
|
| 76 |
+
--batch_size 64 \
|
| 77 |
+
--learning_rate 2e-5
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
## 贡献
|
| 83 |
+
欢迎通过提交拉取请求或在仓库中提出问题来为此模块做出贡献。
|
| 84 |
+
|
| 85 |
+
## License
|
| 86 |
+
本项目遵循[Apache-2.0开源协议](./LICENSE)
|