Spaces:
Runtime error
Runtime error
Upload 5 files
Browse files- .gitattributes +1 -0
- README.md +38 -13
- app.py +87 -0
- requirements.txt +5 -0
- screenshot.png +3 -0
- streamlit_shap.py +126 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
screenshot.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,13 +1,38 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 机器生成文本检测器
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+
|
| 5 |
+
## 简介
|
| 6 |
+
|
| 7 |
+
本应用使用 BERT 模型和 SHAP 解释性分析技术,旨在帮助用户判断一个文本是否可能由机器生成。应用允许用户输入文本,然后使用预先训练好的 BERT 模型进行分析,最后通过 SHAP 提供文本的可解释性分析,帮助理解模型的预测结果。
|
| 8 |
+
|
| 9 |
+
## 功能
|
| 10 |
+
|
| 11 |
+
- **文本输入**:用户可以从预设的文本示例中选择,或者输入自定义的文本进行检测。
|
| 12 |
+
- **机器生成文本概率评估**:应用将显示文本被判断为机器生成的概率。
|
| 13 |
+
- **SHAP 分句可解释性分析**:对于给定的文本,应用将展示哪些部分对模型的判断起到了决定性作用。
|
| 14 |
+
|
| 15 |
+
## 安装
|
| 16 |
+
|
| 17 |
+
1. 克隆仓库或下载代码到本地。
|
| 18 |
+
2. 本项目使用以下依赖:
|
| 19 |
+
```
|
| 20 |
+
matplotlib==3.8.3
|
| 21 |
+
shap==0.44.1
|
| 22 |
+
streamlit==1.31.1
|
| 23 |
+
torch==2.2.0
|
| 24 |
+
transformers==4.38.1
|
| 25 |
+
```
|
| 26 |
+
3. 在命令行中运行 `streamlit run app.py` 启动应用。
|
| 27 |
+
|
| 28 |
+
## 注意事项
|
| 29 |
+
|
| 30 |
+
- 应用需要一定时间来加载模型和分析文本,请耐心等待。
|
| 31 |
+
- SHAP 可解释性分析需要至少 2 句话(以句号、问号、感叹号为划分),过短的文本可能无法进行分析。
|
| 32 |
+
|
| 33 |
+
## 致谢
|
| 34 |
+
|
| 35 |
+
- [shap](https://github.com/shap/shap)
|
| 36 |
+
- [streamlit](https://streamlit.io/)
|
| 37 |
+
- [streamlit-shap](https://github.com/snehankekre/streamlit-shap)
|
| 38 |
+
- [huggingface](https://huggingface.co/)
|
app.py
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import shap
|
| 2 |
+
import streamlit as st
|
| 3 |
+
import torch
|
| 4 |
+
from transformers import BertForSequenceClassification, BertTokenizerFast
|
| 5 |
+
|
| 6 |
+
from streamlit_shap import st_shap
|
| 7 |
+
|
| 8 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 9 |
+
|
| 10 |
+
tokenizer = BertTokenizerFast.from_pretrained(
|
| 11 |
+
"JeremyFeng/machine-generated-text-detection"
|
| 12 |
+
)
|
| 13 |
+
model = BertForSequenceClassification.from_pretrained(
|
| 14 |
+
"JeremyFeng/machine-generated-text-detection"
|
| 15 |
+
).to(device)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def pred(x):
|
| 19 |
+
predlist = []
|
| 20 |
+
for text in x:
|
| 21 |
+
encodings = tokenizer(
|
| 22 |
+
text,
|
| 23 |
+
return_tensors="pt",
|
| 24 |
+
padding=True,
|
| 25 |
+
truncation=True,
|
| 26 |
+
max_length=512,
|
| 27 |
+
return_token_type_ids=False,
|
| 28 |
+
return_attention_mask=True,
|
| 29 |
+
).to(device)
|
| 30 |
+
input_ids, attention_mask = encodings["input_ids"], encodings["attention_mask"]
|
| 31 |
+
|
| 32 |
+
with torch.no_grad():
|
| 33 |
+
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
|
| 34 |
+
logits = outputs.logits
|
| 35 |
+
y_score = torch.nn.functional.softmax(logits, dim=1)[:, 1].cpu().detach()
|
| 36 |
+
predlist.append(y_score)
|
| 37 |
+
|
| 38 |
+
predtensor = torch.cat(predlist)
|
| 39 |
+
return predtensor.numpy()
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
st.title("机器生成文本检测器")
|
| 43 |
+
|
| 44 |
+
default_texts = [
|
| 45 |
+
"图像识别,是指利用计算机对图像进行处理、分析和理解,以识别各种不同模式的目标和对象的技术,是应用深度学习算法的一种实践应用。现阶段图像识别技术一般分为人脸识别与商品识别,人脸识别主要运用在安全检查、身份核验与移动支付中;商品识别主要运用在商品流通过程中,特别是无人货架、智能零售柜等无人零售领域。图像的传统识别流程分为四个步骤:图像采集→图像预处理→特征提取→图像识别。随着科技的不断进步,图像识别技术也越来越成熟,现阶段已经能够高效准确地处理各种复杂场景。特别是卷积神经网络(CNN)等深度学习模型的运用,使得图像识别的精度大大提升。而随着 5G、云计算和人工智能等新一代信息技术的快速发展,图像识别将有可能在更多领域得到广泛应用,如医疗诊断、自动驾驶、无人机等。而且,有了大数据的支持,我们可以通过更多的样本来训练模型,提高模型的性能。",
|
| 46 |
+
"我们在学习中有学习的环境,你是一个很优秀并且得到过奖学金的人,可见你对学习环境的适应力和掌控力是很强的。现在的状态需要你放下以前的心态,重新来过。从零开始,你要到哪个公司里,现在不招人,那就从侧面能多了解就多了解这个公司的状况和要求,让自己在同行业的可以进入的其他公司里磨练,时刻注视着你要去的地方,按那里的要求来要求自己的日常工作。然后再积累经验,提高自我,逐渐向你理想的公司接近。以上所述,无论你任何时候走入新的工作环境,都需要以谦虚的态度学习,以毅力和耐心去适应。但同时,也要积极向前看,进行自我提升,为未来的职业生涯铺路。通过自我磨练和不断学习,你可以获得新的技巧和知识,进一步理解你想要去的公司的工作方式和要求。在获得这些经验之后,你会发现自己的专业素质和适应能力有了显著的提升,也更接近你的职业目标了。",
|
| 47 |
+
"我今年也大一,处境和你很相似。表面是过得去就行,大学里面还是要保持精神上的独立,如果还未遇到志同道合的同学,建议多和导员还有各科老师沟通,他们都是过来人,会理解你的处境。不要忘记,大学也是锻炼人的社交技巧和团队合作能力的地方。参加一些兴趣社团也是好的选择,可以让你结交到来自不同专业,但有着相同兴趣的人。这样你会发现,原来和你一样迷茫的人其实并不少。这种经历,会让你更加坚定,更懂得如何处理人际关系,如何在艰难困苦中找到自己的方向。同时,也一定要注意自我调节,以保持良好的精神和身体健康。这是你走向成功的重要因素。总的来说,通过这样的方式来发现和解决问题,并随着时间的推移,你会发现自己在很大程度上都有所改变和成长,这是最宝贵的。",
|
| 48 |
+
"本文首先基于 Guo et al. (2023) 整理的中文人类-ChatGPT 问答对比语料集(HC3-Chinese),提取其中的人类生成文本。这些人类生成文本主要有两个来源:一是公开可用的问答数据集,这些数据集中的答案由特定领域的专家给出,或是网络用户投票选出的高质量答案;二是从维基百科和百度百科等资料中构造的“概念 - 解释”问答语句对。",
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
selected_text = st.selectbox(
|
| 52 |
+
"选择一个文本示例或输入待检测文本",
|
| 53 |
+
options=["请选择..."] + default_texts,
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
if selected_text != "请选择...":
|
| 57 |
+
text_area_value = selected_text
|
| 58 |
+
else:
|
| 59 |
+
text_area_value = ""
|
| 60 |
+
|
| 61 |
+
user_input = st.text_area(
|
| 62 |
+
"待检测文本",
|
| 63 |
+
value=text_area_value,
|
| 64 |
+
height=300,
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
if user_input == "":
|
| 68 |
+
st.stop()
|
| 69 |
+
|
| 70 |
+
y_score = pred([user_input])
|
| 71 |
+
if y_score[0] < 0.5:
|
| 72 |
+
st.success(f"该文本为机器生成的概率为 {y_score[0]*100:.2f}%", icon="🧑🏻💻")
|
| 73 |
+
else:
|
| 74 |
+
st.error(f"该文本为机器生成的概率为 {y_score[0]*100:.2f}%", icon="🤖")
|
| 75 |
+
|
| 76 |
+
st.subheader("SHAP 分句可解释性分析")
|
| 77 |
+
try:
|
| 78 |
+
masker = shap.maskers.Text(tokenizer=r"[\n。.??!!]")
|
| 79 |
+
explainer = shap.Explainer(pred, masker)
|
| 80 |
+
shap_values = explainer([user_input], fixed_context=1)
|
| 81 |
+
|
| 82 |
+
st_shap(shap.plots.text(shap_values, grouping_threshold=0.8), height=300)
|
| 83 |
+
except Exception as e:
|
| 84 |
+
if "zero-size array to reduction operation maximum which has no identity" in str(e):
|
| 85 |
+
st.error("❗️文本长度过短,无法使用 SHAP 进行分句可解释性分析")
|
| 86 |
+
st.stop()
|
| 87 |
+
st.exception(e)
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
matplotlib==3.8.3
|
| 2 |
+
shap==0.44.1
|
| 3 |
+
streamlit==1.31.1
|
| 4 |
+
torch==2.2.0
|
| 5 |
+
transformers==4.38.1
|
screenshot.png
ADDED
|
Git LFS Details
|
streamlit_shap.py
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import base64
|
| 2 |
+
|
| 3 |
+
# Shap plots internally call plt.show()
|
| 4 |
+
# On Linux, prevent plt.show() from emitting a non-GUI backend warning.
|
| 5 |
+
import os
|
| 6 |
+
from io import BytesIO
|
| 7 |
+
|
| 8 |
+
import matplotlib.pyplot as plt
|
| 9 |
+
import shap
|
| 10 |
+
import streamlit.components.v1 as components
|
| 11 |
+
from matplotlib.figure import Figure
|
| 12 |
+
|
| 13 |
+
os.environ.pop("DISPLAY", None)
|
| 14 |
+
# Text plots return a IPython.core.display.HTML object
|
| 15 |
+
# Set diplay=False to return HTML string instead
|
| 16 |
+
shap.plots.text.__defaults__ = (0, 0.01, "", None, None, None, False)
|
| 17 |
+
# Prevent clipping of the ticks and axis labels
|
| 18 |
+
plt.rcParams["figure.autolayout"] = True
|
| 19 |
+
|
| 20 |
+
# Note: Colorbar changes (introduced bugs) in matplotlib>3.4.3
|
| 21 |
+
# cause the colorbar of certain shap plots (e.g. beeswarm) to not display properly
|
| 22 |
+
# See: https://github.com/matplotlib/matplotlib/issues/22625 and
|
| 23 |
+
# https://github.com/matplotlib/matplotlib/issues/22087
|
| 24 |
+
# If colorbars are not displayed properly, try downgrading matplotlib to 3.4.3
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def st_shap(plot, height=None, width=None):
|
| 28 |
+
"""Takes a SHAP plot as input, and returns a streamlit.delta_generator.DeltaGenerator as output.
|
| 29 |
+
|
| 30 |
+
It is recommended to set the height and width
|
| 31 |
+
parameter to have the plot fit to the window.
|
| 32 |
+
|
| 33 |
+
Parameters
|
| 34 |
+
----------
|
| 35 |
+
plot : None or matplotlib.figure.Figure or SHAP plot object
|
| 36 |
+
The SHAP plot object.
|
| 37 |
+
height: int or None
|
| 38 |
+
The height of the plot in pixels.
|
| 39 |
+
width: int or None
|
| 40 |
+
The width of the plot in pixels.
|
| 41 |
+
|
| 42 |
+
Returns
|
| 43 |
+
-------
|
| 44 |
+
streamlit.delta_generator.DeltaGenerator
|
| 45 |
+
A SHAP plot as a streamlit.delta_generator.DeltaGenerator object.
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
# Plots such as waterfall and bar have no return value
|
| 49 |
+
# They create a new figure and call plt.show()
|
| 50 |
+
if plot is None:
|
| 51 |
+
# Test whether there is currently a Figure on the pyplot figure stack
|
| 52 |
+
# A Figure exists if the shap plot called plt.show()
|
| 53 |
+
if plt.get_fignums():
|
| 54 |
+
fig = plt.gcf()
|
| 55 |
+
ax = plt.gca()
|
| 56 |
+
|
| 57 |
+
# Save it to a temporary buffer
|
| 58 |
+
buf = BytesIO()
|
| 59 |
+
|
| 60 |
+
if height is None:
|
| 61 |
+
_, height = fig.get_size_inches() * fig.dpi
|
| 62 |
+
|
| 63 |
+
if width is None:
|
| 64 |
+
width, _ = fig.get_size_inches() * fig.dpi
|
| 65 |
+
|
| 66 |
+
fig.set_size_inches(width / fig.dpi, height / fig.dpi, forward=True)
|
| 67 |
+
fig.savefig(buf, format="png")
|
| 68 |
+
|
| 69 |
+
# Embed the result in the HTML output
|
| 70 |
+
data = base64.b64encode(buf.getbuffer()).decode("ascii")
|
| 71 |
+
html_str = f"<img src='data:image/png;base64,{data}'/>"
|
| 72 |
+
|
| 73 |
+
# Enable pyplot to properly clean up the memory
|
| 74 |
+
plt.cla()
|
| 75 |
+
plt.close(fig)
|
| 76 |
+
|
| 77 |
+
fig = components.html(html_str, height=height, width=width)
|
| 78 |
+
else:
|
| 79 |
+
fig = components.html(
|
| 80 |
+
"<p>[Error] No plot to display. Received object of type <class 'NoneType'>.</p>"
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
# SHAP plots return a matplotlib.figure.Figure object when passed show=False as an argument
|
| 84 |
+
elif isinstance(plot, Figure):
|
| 85 |
+
fig = plot
|
| 86 |
+
|
| 87 |
+
# Save it to a temporary buffer
|
| 88 |
+
buf = BytesIO()
|
| 89 |
+
|
| 90 |
+
if height is None:
|
| 91 |
+
_, height = fig.get_size_inches() * fig.dpi
|
| 92 |
+
|
| 93 |
+
if width is None:
|
| 94 |
+
width, _ = fig.get_size_inches() * fig.dpi
|
| 95 |
+
|
| 96 |
+
fig.set_size_inches(width / fig.dpi, height / fig.dpi, forward=True)
|
| 97 |
+
fig.savefig(buf, format="png")
|
| 98 |
+
|
| 99 |
+
# Embed the result in the HTML output
|
| 100 |
+
data = base64.b64encode(buf.getbuffer()).decode("ascii")
|
| 101 |
+
html_str = f"<img src='data:image/png;base64,{data}'/>"
|
| 102 |
+
|
| 103 |
+
# Enable pyplot to properly clean up the memory
|
| 104 |
+
plt.cla()
|
| 105 |
+
plt.close(fig)
|
| 106 |
+
|
| 107 |
+
fig = components.html(html_str, height=height, width=width)
|
| 108 |
+
|
| 109 |
+
# SHAP plots containing JS/HTML have one or more of the following callable attributes
|
| 110 |
+
elif hasattr(plot, "html") or hasattr(plot, "data") or hasattr(plot, "matplotlib"):
|
| 111 |
+
shap_js = f"{shap.getjs()}".replace("height=350", f"height={height}").replace(
|
| 112 |
+
"width=100", f"width={width}"
|
| 113 |
+
)
|
| 114 |
+
shap_html = f"<head>{shap_js}</head><body>{plot.html()}</body>"
|
| 115 |
+
fig = components.html(shap_html, height=height, width=width)
|
| 116 |
+
|
| 117 |
+
# shap.plots.text plots have been overridden to return a string
|
| 118 |
+
elif isinstance(plot, str):
|
| 119 |
+
fig = components.html(plot, height=height, width=width, scrolling=True)
|
| 120 |
+
|
| 121 |
+
else:
|
| 122 |
+
fig = components.html(
|
| 123 |
+
"<p>[Error] No plot to display. Unable to understand input.</p>"
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
return fig
|