{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🎓 中文情感分析系统:交互式教学教程\n", "\n", "## 👋 欢迎!\n", "欢迎来到这份专为学习者设计的 **交互式 Jupyter Notebook** 教程。\n", "\n", "**本项目的目标**:我们将从零开始,构建一个能够理解中文评论“情绪”的人工智能模型。不是简单地调用 API,而是亲手训练一个工业级的 **BERT** 模型。\n", "\n", "## 📚 你将学到什么?\n", "1. **环境配置**:如何利用 Mac 的 MPS 加速深度学习。\n", "2. **数据工程**:从 Hugging Face 获取数据,并清洗、统一。\n", "3. **模型原理**:BERT 是如何理解中文的?\n", "4. **模型训练**:如何进行微调 (Fine-tuning) 以适应特定任务。\n", "5. **模型应用**:如何用自己训练的模型来分析一句话。\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1️⃣ 第一步:导入工具包与环境检查\n", "\n", "在开始做菜之前,我们需要先把锅碗瓢盆(工具包)准备好。\n", "\n", "**核心工具介绍**:\n", "* **Transformers**: 由 Hugging Face 提供,是目前全世界最流行的 NLP 库,用来加载 BERT 模型。\n", "* **Datasets**:这也是 Hugging Face 的产品,用来下载与处理海量数据。\n", "* **Pandas**: 用来像 Excel 一样查看数据表格。\n", "* **Torch**: Pytorch 深度学习框架,我们的“引擎”。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import torch\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from datasets import load_dataset, concatenate_datasets\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n", "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n", "\n", "# === 硬件加速检查 ===\n", "# 深度学习需要大量的矩阵计算,CPU 算得太慢。\n", "# Mac 电脑有专门的 MPS (Metal Performance Shaders) 加速芯片。\n", "if torch.backends.mps.is_available():\n", " device = torch.device(\"mps\")\n", " print(\"✅ 恭喜!检测到 Mac MPS 硬件加速,训练速度将起飞!🚀\")\n", "elif torch.cuda.is_available():\n", " device = torch.device(\"cuda\")\n", " print(\"✅ 检测到 NVIDIA CUDA,将使用 GPU 训练。\")\n", "else:\n", " device = torch.device(\"cpu\")\n", " print(\"⚠️ 未检测到 GPU,将使用 CPU 训练。速度可能会比较慢,请耐心等待。☕️\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2️⃣ 第二步:配置参数 (Config)\n", "\n", "为了让代码整洁,我们将所有的“设置项”都放在这里。这就好比做菜前的“菜谱”。\n", "\n", "* **BASE_MODEL**: 我们选用的基底模型是 `bert-base-chinese`,它是谷歌训练好的、已经读过几亿字中文的“高材生”。\n", "* **NUM_EPOCHS**: 训练轮数。设为 3,意味着模型会把我们的教材从头到尾看 3 遍。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Config:\n", " # 基模型:BERT 中文版\n", " BASE_MODEL = \"google-bert/bert-base-chinese\"\n", " \n", " # 分类数量:3类 (消极-0, 中性-1, 积极-2)\n", " NUM_LABELS = 3\n", " \n", " # 每一句话最长处理多少个字?超过的截断,不足的补0\n", " MAX_LENGTH = 128\n", " \n", " # 路径配置\n", " OUTPUT_DIR = \"../checkpoints/tutorial_model\"\n", " \n", " # 训练超参数\n", " BATCH_SIZE = 16 # 一次可以并行处理多少句话 (看显存大小)\n", " LEARNING_RATE = 2e-5 # 学习率:模型学得太快容易学偏,太慢容易学不会。2e-5 是经验值。\n", " NUM_EPOCHS = 3 # 训练几轮\n", " \n", " # 标签字典\n", " ID2LABEL = {0: 'Negative (消极)', 1: 'Neutral (中性)', 2: 'Positive (积极)'}\n", " LABEL2ID = {'negative': 0, 'neutral': 1, 'positive': 2}\n", "\n", "print(\"配置加载完毕。\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3️⃣ 第三步:准备数据 (Data Preparation)\n", "\n", "我们的策略是 **“混合双打”**:\n", "1. **通用数据** (`clapAI`): 包含日常生活的各种评论,让模型懂常识。\n", "2. **垂直数据** (`OpenModels`): 包含中医药领域的评论,让模型懂行话。\n", "\n", "下面的代码会自动从网络加载这些数据,并进行清洗。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 加载 Tokenizer (分词器)\n", "# 它的作用是把汉字转换成模型能读懂的数字 ID\n", "tokenizer = AutoTokenizer.from_pretrained(Config.BASE_MODEL)\n", "\n", "def prepare_dataset():\n", " print(\"⏳ 正在加载数据 (可能需要一点时间下载)...\")\n", " \n", " # 为了演示速度,我们只取前 1000 条数据 (正式训练时会用全部数据)\n", " # 如果电脑性能好,可以把 split=\"train[:1000]\" 改成 split=\"train\"\n", " sample_size = 500\n", " \n", " # 1. 加载通用情感数据\n", " ds_clap = load_dataset(\"clapAI/MultiLingualSentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n", " ds_clap = ds_clap.filter(lambda x: x['language'] == 'zh') # 只留中文\n", " \n", " # 2. 加载中医药情感数据\n", " ds_med = load_dataset(\"OpenModels/Chinese-Herbal-Medicine-Sentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n", " \n", " # 3. 统一列名\n", " # 不同数据集的列名可能不一样,我们要把它们统一改成 'text' 和 'label'\n", " if 'review_text' in ds_med.column_names: ds_med = ds_med.rename_column('review_text', 'text')\n", " if 'sentiment_label' in ds_med.column_names: ds_med = ds_med.rename_column('sentiment_label', 'label')\n", " \n", " # 4. 合并数据集\n", " common_cols = ['text', 'label']\n", " combined = concatenate_datasets([ds_clap.select_columns(common_cols), ds_med.select_columns(common_cols)])\n", " \n", " # 5. 数据清洗与统一标签\n", " def process_data(example):\n", " # 统一标签为数字 0, 1, 2\n", " lbl = example['label']\n", " if isinstance(lbl, str):\n", " lbl = lbl.lower()\n", " if lbl in ['negative', '0']: lbl = 0\n", " elif lbl in ['neutral', '1']: lbl = 1\n", " elif lbl in ['positive', '2']: lbl = 2\n", " return {'labels': int(lbl)}\n", " \n", " combined = combined.map(process_data)\n", " \n", " # 6. 分词 (Tokenization)\n", " def tokenize(batch):\n", " return tokenizer(batch['text'], padding=\"max_length\", truncation=True, max_length=Config.MAX_LENGTH)\n", " \n", " print(\"✂️ 正在进行分词处理...\")\n", " tokenized_ds = combined.map(tokenize, batched=True)\n", " \n", " # 7. 划分训练集和验证集 (90% 训练, 10% 验证)\n", " return tokenized_ds.train_test_split(test_size=0.1)\n", "\n", "# 执行数据准备\n", "dataset = prepare_dataset()\n", "print(f\"\\n✅ 数据准备完成!\\n训练集大小: {len(dataset['train'])} 条\\n测试集大小: {len(dataset['test'])} 条\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4️⃣ 第四步:数据可视化 (Data Visualization)\n", "\n", "很多时候模型训练不好是因为数据分布不均匀(比如全是好评,那模型只要一直猜好评准确率也很高,但这没用)。\n", "让我们画个饼图来看看我们的数据怎么样。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 从 dataset 中提取 label 列\n", "train_labels = dataset['train']['labels']\n", "\n", "# 统计每个类别的数量\n", "labels_count = pd.Series(train_labels).value_counts().sort_index()\n", "labels_name = [Config.ID2LABEL[i] for i in labels_count.index]\n", "\n", "# 由于 Matplotlib 默认不支持中文,我们用英文显示或者设置字体,这里为了简单直接用英文\n", "plt.figure(figsize=(8, 5))\n", "plt.pie(labels_count, labels=labels_name, autopct='%1.1f%%', colors=['#ff9999','#66b3ff','#99ff99'])\n", "plt.title('Training Data Distribution')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5️⃣ 第五步:模型训练 (Model Training)\n", "\n", "这是最激动人心的一步!我们将启动 Hugging Face `Trainer`。\n", "\n", "我们将实现一个**“智能跳过”**逻辑:如果检测到之前已经训练好了模型,就直接加载,不再浪费时间重新训练。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 定义评价指标:我们需要知道模型的准确率(Accuracy)\n", "def compute_metrics(pred):\n", " labels = pred.label_ids\n", " preds = pred.predictions.argmax(-1)\n", " acc = accuracy_score(labels, preds)\n", " return {'accuracy': acc}\n", "\n", "# 检查是否已存在\n", "if os.path.exists(Config.OUTPUT_DIR) and os.path.exists(os.path.join(Config.OUTPUT_DIR, \"config.json\")):\n", " print(f\"🎉 检测到已训练的模型: {Config.OUTPUT_DIR}\")\n", " print(\"🚀 直接加载模型,跳过训练!\")\n", " model = AutoModelForSequenceClassification.from_pretrained(Config.OUTPUT_DIR)\n", " model.to(device)\n", "else:\n", " print(\"💪 未找到已训练模型,开始新一轮训练...\")\n", " \n", " # 加载初始模型\n", " model = AutoModelForSequenceClassification.from_pretrained(Config.BASE_MODEL, num_labels=Config.NUM_LABELS)\n", " model.to(device)\n", " \n", " # 设置训练参数\n", " training_args = TrainingArguments(\n", " output_dir=Config.OUTPUT_DIR,\n", " num_train_epochs=Config.NUM_EPOCHS,\n", " per_device_train_batch_size=Config.BATCH_SIZE,\n", " evaluation_strategy=\"epoch\", # 每个 Epoch 结束后评估一次\n", " save_strategy=\"epoch\", # 每个 Epoch 结束后保存一次\n", " logging_steps=10,\n", " report_to=\"none\" # 不上报到wandb\n", " )\n", " \n", " # 初始化训练器\n", " trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=dataset['train'],\n", " eval_dataset=dataset['test'],\n", " processing_class=tokenizer,\n", " compute_metrics=compute_metrics\n", " )\n", " \n", " # 开始训练!\n", " trainer.train()\n", " \n", " # 保存最终结果\n", " trainer.save_model(Config.OUTPUT_DIR)\n", " tokenizer.save_pretrained(Config.OUTPUT_DIR)\n", " print(\"💾 训练完成,模型已保存!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6️⃣ 第六步:互动测试 (Inference Demo)\n", "\n", "现在模型已经“毕业”了,让我们来考考它!\n", "在下面的输入框里随便输入一句话(支持中文),点击“分析”看看它觉得的情感是什么。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "from IPython.display import display\n", "\n", "# 预测函数\n", "def predict_sentiment(text):\n", " # 1. 预处理\n", " inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, max_length=128, padding=True)\n", " inputs = {k: v.to(device) for k, v in inputs.items()}\n", " \n", " # 2. 模型推理\n", " with torch.no_grad():\n", " outputs = model(**inputs)\n", " probs = torch.nn.functional.softmax(outputs.logits, dim=-1)\n", " \n", " # 3. 结果解析\n", " pred_idx = torch.argmax(probs).item()\n", " confidence = probs[0][pred_idx].item()\n", " label = Config.ID2LABEL[pred_idx]\n", " \n", " return label, confidence\n", "\n", "# 界面组件\n", "text_box = widgets.Text(placeholder='请输入要分析的句子...', description='评论:', layout=widgets.Layout(width='400px'))\n", "btn_run = widgets.Button(description=\"开始分析\", button_style='primary')\n", "output_area = widgets.Output()\n", "\n", "def on_click(b):\n", " with output_area:\n", " output_area.clear_output()\n", " text = text_box.value\n", " if not text:\n", " print(\"❌ 请先输入内容!\")\n", " return\n", " \n", " print(f\"🔍 正在分析: \\\"{text}\\\"\")\n", " label, conf = predict_sentiment(text)\n", " \n", " # 只有置信度高才显示绿色,否则显示黄色\n", " icon = \"✅\" if conf > 0.8 else \"🤔\"\n", " print(f\"{icon} 预测结果: [{label}] \")\n", " print(f\"📊 置信度: {conf*100:.2f}%\")\n", "\n", "btn_run.on_click(on_click)\n", "display(text_box, btn_run, output_area)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 2 }