File size: 18,620 Bytes
af9853e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🎓 中文情感分析系统:交互式教学教程\n",
"\n",
"## 👋 欢迎!\n",
"欢迎来到这份专为学习者设计的 **交互式 Jupyter Notebook** 教程。\n",
"\n",
"**本项目的目标**:我们将从零开始,构建一个能够理解中文评论“情绪”的人工智能模型。不是简单地调用 API,而是亲手训练一个工业级的 **BERT** 模型。\n",
"\n",
"## 📚 你将学到什么?\n",
"1. **环境配置**:如何利用 Mac 的 MPS 加速深度学习。\n",
"2. **数据工程**:从 Hugging Face 获取数据,并清洗、统一。\n",
"3. **模型原理**:BERT 是如何理解中文的?\n",
"4. **模型训练**:如何进行微调 (Fine-tuning) 以适应特定任务。\n",
"5. **模型应用**:如何用自己训练的模型来分析一句话。\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1️⃣ 第一步:导入工具包与环境检查\n",
"\n",
"在开始做菜之前,我们需要先把锅碗瓢盆(工具包)准备好。\n",
"\n",
"**核心工具介绍**:\n",
"* **Transformers**: 由 Hugging Face 提供,是目前全世界最流行的 NLP 库,用来加载 BERT 模型。\n",
"* **Datasets**:这也是 Hugging Face 的产品,用来下载与处理海量数据。\n",
"* **Pandas**: 用来像 Excel 一样查看数据表格。\n",
"* **Torch**: Pytorch 深度学习框架,我们的“引擎”。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from datasets import load_dataset, concatenate_datasets\n",
"from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
"from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
"\n",
"# === 硬件加速检查 ===\n",
"# 深度学习需要大量的矩阵计算,CPU 算得太慢。\n",
"# Mac 电脑有专门的 MPS (Metal Performance Shaders) 加速芯片。\n",
"if torch.backends.mps.is_available():\n",
" device = torch.device(\"mps\")\n",
" print(\"✅ 恭喜!检测到 Mac MPS 硬件加速,训练速度将起飞!🚀\")\n",
"elif torch.cuda.is_available():\n",
" device = torch.device(\"cuda\")\n",
" print(\"✅ 检测到 NVIDIA CUDA,将使用 GPU 训练。\")\n",
"else:\n",
" device = torch.device(\"cpu\")\n",
" print(\"⚠️ 未检测到 GPU,将使用 CPU 训练。速度可能会比较慢,请耐心等待。☕️\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2️⃣ 第二步:配置参数 (Config)\n",
"\n",
"为了让代码整洁,我们将所有的“设置项”都放在这里。这就好比做菜前的“菜谱”。\n",
"\n",
"* **BASE_MODEL**: 我们选用的基底模型是 `bert-base-chinese`,它是谷歌训练好的、已经读过几亿字中文的“高材生”。\n",
"* **NUM_EPOCHS**: 训练轮数。设为 3,意味着模型会把我们的教材从头到尾看 3 遍。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Config:\n",
" # 基模型:BERT 中文版\n",
" BASE_MODEL = \"google-bert/bert-base-chinese\"\n",
" \n",
" # 分类数量:3类 (消极-0, 中性-1, 积极-2)\n",
" NUM_LABELS = 3\n",
" \n",
" # 每一句话最长处理多少个字?超过的截断,不足的补0\n",
" MAX_LENGTH = 128\n",
" \n",
" # 路径配置\n",
" OUTPUT_DIR = \"../checkpoints/tutorial_model\"\n",
" \n",
" # 训练超参数\n",
" BATCH_SIZE = 16 # 一次可以并行处理多少句话 (看显存大小)\n",
" LEARNING_RATE = 2e-5 # 学习率:模型学得太快容易学偏,太慢容易学不会。2e-5 是经验值。\n",
" NUM_EPOCHS = 3 # 训练几轮\n",
" \n",
" # 标签字典\n",
" ID2LABEL = {0: 'Negative (消极)', 1: 'Neutral (中性)', 2: 'Positive (积极)'}\n",
" LABEL2ID = {'negative': 0, 'neutral': 1, 'positive': 2}\n",
"\n",
"print(\"配置加载完毕。\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3️⃣ 第三步:准备数据 (Data Preparation)\n",
"\n",
"我们的策略是 **“混合双打”**:\n",
"1. **通用数据** (`clapAI`): 包含日常生活的各种评论,让模型懂常识。\n",
"2. **垂直数据** (`OpenModels`): 包含中医药领域的评论,让模型懂行话。\n",
"\n",
"下面的代码会自动从网络加载这些数据,并进行清洗。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 加载 Tokenizer (分词器)\n",
"# 它的作用是把汉字转换成模型能读懂的数字 ID\n",
"tokenizer = AutoTokenizer.from_pretrained(Config.BASE_MODEL)\n",
"\n",
"def prepare_dataset():\n",
" print(\"⏳ 正在加载数据 (可能需要一点时间下载)...\")\n",
" \n",
" # 为了演示速度,我们只取前 1000 条数据 (正式训练时会用全部数据)\n",
" # 如果电脑性能好,可以把 split=\"train[:1000]\" 改成 split=\"train\"\n",
" sample_size = 500\n",
" \n",
" # 1. 加载通用情感数据\n",
" ds_clap = load_dataset(\"clapAI/MultiLingualSentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n",
" ds_clap = ds_clap.filter(lambda x: x['language'] == 'zh') # 只留中文\n",
" \n",
" # 2. 加载中医药情感数据\n",
" ds_med = load_dataset(\"OpenModels/Chinese-Herbal-Medicine-Sentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n",
" \n",
" # 3. 统一列名\n",
" # 不同数据集的列名可能不一样,我们要把它们统一改成 'text' 和 'label'\n",
" if 'review_text' in ds_med.column_names: ds_med = ds_med.rename_column('review_text', 'text')\n",
" if 'sentiment_label' in ds_med.column_names: ds_med = ds_med.rename_column('sentiment_label', 'label')\n",
" \n",
" # 4. 合并数据集\n",
" common_cols = ['text', 'label']\n",
" combined = concatenate_datasets([ds_clap.select_columns(common_cols), ds_med.select_columns(common_cols)])\n",
" \n",
" # 5. 数据清洗与统一标签\n",
" def process_data(example):\n",
" # 统一标签为数字 0, 1, 2\n",
" lbl = example['label']\n",
" if isinstance(lbl, str):\n",
" lbl = lbl.lower()\n",
" if lbl in ['negative', '0']: lbl = 0\n",
" elif lbl in ['neutral', '1']: lbl = 1\n",
" elif lbl in ['positive', '2']: lbl = 2\n",
" return {'labels': int(lbl)}\n",
" \n",
" combined = combined.map(process_data)\n",
" \n",
" # 6. 分词 (Tokenization)\n",
" def tokenize(batch):\n",
" return tokenizer(batch['text'], padding=\"max_length\", truncation=True, max_length=Config.MAX_LENGTH)\n",
" \n",
" print(\"✂️ 正在进行分词处理...\")\n",
" tokenized_ds = combined.map(tokenize, batched=True)\n",
" \n",
" # 7. 划分训练集和验证集 (90% 训练, 10% 验证)\n",
" return tokenized_ds.train_test_split(test_size=0.1)\n",
"\n",
"# 执行数据准备\n",
"dataset = prepare_dataset()\n",
"print(f\"\\n✅ 数据准备完成!\\n训练集大小: {len(dataset['train'])} 条\\n测试集大小: {len(dataset['test'])} 条\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4️⃣ 第四步:数据可视化 (Data Visualization)\n",
"\n",
"很多时候模型训练不好是因为数据分布不均匀(比如全是好评,那模型只要一直猜好评准确率也很高,但这没用)。\n",
"让我们画个饼图来看看我们的数据怎么样。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 从 dataset 中提取 label 列\n",
"train_labels = dataset['train']['labels']\n",
"\n",
"# 统计每个类别的数量\n",
"labels_count = pd.Series(train_labels).value_counts().sort_index()\n",
"labels_name = [Config.ID2LABEL[i] for i in labels_count.index]\n",
"\n",
"# 由于 Matplotlib 默认不支持中文,我们用英文显示或者设置字体,这里为了简单直接用英文\n",
"plt.figure(figsize=(8, 5))\n",
"plt.pie(labels_count, labels=labels_name, autopct='%1.1f%%', colors=['#ff9999','#66b3ff','#99ff99'])\n",
"plt.title('Training Data Distribution')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5️⃣ 第五步:模型训练 (Model Training)\n",
"\n",
"这是最激动人心的一步!我们将启动 Hugging Face `Trainer`。\n",
"\n",
"我们将实现一个**“智能跳过”**逻辑:如果检测到之前已经训练好了模型,就直接加载,不再浪费时间重新训练。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 定义评价指标:我们需要知道模型的准确率(Accuracy)\n",
"def compute_metrics(pred):\n",
" labels = pred.label_ids\n",
" preds = pred.predictions.argmax(-1)\n",
" acc = accuracy_score(labels, preds)\n",
" return {'accuracy': acc}\n",
"\n",
"# 检查是否已存在\n",
"if os.path.exists(Config.OUTPUT_DIR) and os.path.exists(os.path.join(Config.OUTPUT_DIR, \"config.json\")):\n",
" print(f\"🎉 检测到已训练的模型: {Config.OUTPUT_DIR}\")\n",
" print(\"🚀 直接加载模型,跳过训练!\")\n",
" model = AutoModelForSequenceClassification.from_pretrained(Config.OUTPUT_DIR)\n",
" model.to(device)\n",
"else:\n",
" print(\"💪 未找到已训练模型,开始新一轮训练...\")\n",
" \n",
" # 加载初始模型\n",
" model = AutoModelForSequenceClassification.from_pretrained(Config.BASE_MODEL, num_labels=Config.NUM_LABELS)\n",
" model.to(device)\n",
" \n",
" # 设置训练参数\n",
" training_args = TrainingArguments(\n",
" output_dir=Config.OUTPUT_DIR,\n",
" num_train_epochs=Config.NUM_EPOCHS,\n",
" per_device_train_batch_size=Config.BATCH_SIZE,\n",
" evaluation_strategy=\"epoch\", # 每个 Epoch 结束后评估一次\n",
" save_strategy=\"epoch\", # 每个 Epoch 结束后保存一次\n",
" logging_steps=10,\n",
" report_to=\"none\" # 不上报到wandb\n",
" )\n",
" \n",
" # 初始化训练器\n",
" trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=dataset['train'],\n",
" eval_dataset=dataset['test'],\n",
" processing_class=tokenizer,\n",
" compute_metrics=compute_metrics\n",
" )\n",
" \n",
" # 开始训练!\n",
" trainer.train()\n",
" \n",
" # 保存最终结果\n",
" trainer.save_model(Config.OUTPUT_DIR)\n",
" tokenizer.save_pretrained(Config.OUTPUT_DIR)\n",
" print(\"💾 训练完成,模型已保存!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6️⃣ 第六步:互动测试 (Inference Demo)\n",
"\n",
"现在模型已经“毕业”了,让我们来考考它!\n",
"在下面的输入框里随便输入一句话(支持中文),点击“分析”看看它觉得的情感是什么。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ipywidgets as widgets\n",
"from IPython.display import display\n",
"\n",
"# 预测函数\n",
"def predict_sentiment(text):\n",
" # 1. 预处理\n",
" inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, max_length=128, padding=True)\n",
" inputs = {k: v.to(device) for k, v in inputs.items()}\n",
" \n",
" # 2. 模型推理\n",
" with torch.no_grad():\n",
" outputs = model(**inputs)\n",
" probs = torch.nn.functional.softmax(outputs.logits, dim=-1)\n",
" \n",
" # 3. 结果解析\n",
" pred_idx = torch.argmax(probs).item()\n",
" confidence = probs[0][pred_idx].item()\n",
" label = Config.ID2LABEL[pred_idx]\n",
" \n",
" return label, confidence\n",
"\n",
"# 界面组件\n",
"text_box = widgets.Text(placeholder='请输入要分析的句子...', description='评论:', layout=widgets.Layout(width='400px'))\n",
"btn_run = widgets.Button(description=\"开始分析\", button_style='primary')\n",
"output_area = widgets.Output()\n",
"\n",
"def on_click(b):\n",
" with output_area:\n",
" output_area.clear_output()\n",
" text = text_box.value\n",
" if not text:\n",
" print(\"❌ 请先输入内容!\")\n",
" return\n",
" \n",
" print(f\"🔍 正在分析: \\\"{text}\\\"\")\n",
" label, conf = predict_sentiment(text)\n",
" \n",
" # 只有置信度高才显示绿色,否则显示黄色\n",
" icon = \"✅\" if conf > 0.8 else \"🤔\"\n",
" print(f\"{icon} 预测结果: [{label}] \")\n",
" print(f\"📊 置信度: {conf*100:.2f}%\")\n",
"\n",
"btn_run.on_click(on_click)\n",
"display(text_box, btn_run, output_area)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
} |