File size: 18,620 Bytes
af9853e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# 🎓 中文情感分析系统:交互式教学教程\n",
                "\n",
                "## 👋 欢迎!\n",
                "欢迎来到这份专为学习者设计的 **交互式 Jupyter Notebook** 教程。\n",
                "\n",
                "**本项目的目标**:我们将从零开始,构建一个能够理解中文评论“情绪”的人工智能模型。不是简单地调用 API,而是亲手训练一个工业级的 **BERT** 模型。\n",
                "\n",
                "## 📚 你将学到什么?\n",
                "1.  **环境配置**:如何利用 Mac 的 MPS 加速深度学习。\n",
                "2.  **数据工程**:从 Hugging Face 获取数据,并清洗、统一。\n",
                "3.  **模型原理**:BERT 是如何理解中文的?\n",
                "4.  **模型训练**:如何进行微调 (Fine-tuning) 以适应特定任务。\n",
                "5.  **模型应用**:如何用自己训练的模型来分析一句话。\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1️⃣ 第一步:导入工具包与环境检查\n",
                "\n",
                "在开始做菜之前,我们需要先把锅碗瓢盆(工具包)准备好。\n",
                "\n",
                "**核心工具介绍**:\n",
                "*   **Transformers**: 由 Hugging Face 提供,是目前全世界最流行的 NLP 库,用来加载 BERT 模型。\n",
                "*   **Datasets**:这也是 Hugging Face 的产品,用来下载与处理海量数据。\n",
                "*   **Pandas**: 用来像 Excel 一样查看数据表格。\n",
                "*   **Torch**: Pytorch 深度学习框架,我们的“引擎”。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import os\n",
                "import torch\n",
                "import pandas as pd\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from datasets import load_dataset, concatenate_datasets\n",
                "from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
                "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
                "\n",
                "# === 硬件加速检查 ===\n",
                "# 深度学习需要大量的矩阵计算,CPU 算得太慢。\n",
                "# Mac 电脑有专门的 MPS (Metal Performance Shaders) 加速芯片。\n",
                "if torch.backends.mps.is_available():\n",
                "    device = torch.device(\"mps\")\n",
                "    print(\"✅ 恭喜!检测到 Mac MPS 硬件加速,训练速度将起飞!🚀\")\n",
                "elif torch.cuda.is_available():\n",
                "    device = torch.device(\"cuda\")\n",
                "    print(\"✅ 检测到 NVIDIA CUDA,将使用 GPU 训练。\")\n",
                "else:\n",
                "    device = torch.device(\"cpu\")\n",
                "    print(\"⚠️ 未检测到 GPU,将使用 CPU 训练。速度可能会比较慢,请耐心等待。☕️\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2️⃣ 第二步:配置参数 (Config)\n",
                "\n",
                "为了让代码整洁,我们将所有的“设置项”都放在这里。这就好比做菜前的“菜谱”。\n",
                "\n",
                "*   **BASE_MODEL**: 我们选用的基底模型是 `bert-base-chinese`,它是谷歌训练好的、已经读过几亿字中文的“高材生”。\n",
                "*   **NUM_EPOCHS**: 训练轮数。设为 3,意味着模型会把我们的教材从头到尾看 3 遍。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "class Config:\n",
                "    # 基模型:BERT 中文版\n",
                "    BASE_MODEL = \"google-bert/bert-base-chinese\"\n",
                "    \n",
                "    # 分类数量:3类 (消极-0, 中性-1, 积极-2)\n",
                "    NUM_LABELS = 3\n",
                "    \n",
                "    # 每一句话最长处理多少个字?超过的截断,不足的补0\n",
                "    MAX_LENGTH = 128\n",
                "    \n",
                "    # 路径配置\n",
                "    OUTPUT_DIR = \"../checkpoints/tutorial_model\"\n",
                "    \n",
                "    # 训练超参数\n",
                "    BATCH_SIZE = 16  # 一次可以并行处理多少句话 (看显存大小)\n",
                "    LEARNING_RATE = 2e-5  # 学习率:模型学得太快容易学偏,太慢容易学不会。2e-5 是经验值。\n",
                "    NUM_EPOCHS = 3   # 训练几轮\n",
                "    \n",
                "    # 标签字典\n",
                "    ID2LABEL = {0: 'Negative (消极)', 1: 'Neutral (中性)', 2: 'Positive (积极)'}\n",
                "    LABEL2ID = {'negative': 0, 'neutral': 1, 'positive': 2}\n",
                "\n",
                "print(\"配置加载完毕。\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3️⃣ 第三步:准备数据 (Data Preparation)\n",
                "\n",
                "我们的策略是 **“混合双打”**:\n",
                "1.  **通用数据** (`clapAI`): 包含日常生活的各种评论,让模型懂常识。\n",
                "2.  **垂直数据** (`OpenModels`): 包含中医药领域的评论,让模型懂行话。\n",
                "\n",
                "下面的代码会自动从网络加载这些数据,并进行清洗。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# 加载 Tokenizer (分词器)\n",
                "# 它的作用是把汉字转换成模型能读懂的数字 ID\n",
                "tokenizer = AutoTokenizer.from_pretrained(Config.BASE_MODEL)\n",
                "\n",
                "def prepare_dataset():\n",
                "    print(\"⏳ 正在加载数据 (可能需要一点时间下载)...\")\n",
                "    \n",
                "    # 为了演示速度,我们只取前 1000 条数据 (正式训练时会用全部数据)\n",
                "    # 如果电脑性能好,可以把 split=\"train[:1000]\" 改成 split=\"train\"\n",
                "    sample_size = 500\n",
                "    \n",
                "    # 1. 加载通用情感数据\n",
                "    ds_clap = load_dataset(\"clapAI/MultiLingualSentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n",
                "    ds_clap = ds_clap.filter(lambda x: x['language'] == 'zh') # 只留中文\n",
                "    \n",
                "    # 2. 加载中医药情感数据\n",
                "    ds_med = load_dataset(\"OpenModels/Chinese-Herbal-Medicine-Sentiment\", split=f\"train[:{sample_size}]\", trust_remote_code=True)\n",
                "    \n",
                "    # 3. 统一列名\n",
                "    # 不同数据集的列名可能不一样,我们要把它们统一改成 'text' 和 'label'\n",
                "    if 'review_text' in ds_med.column_names: ds_med = ds_med.rename_column('review_text', 'text')\n",
                "    if 'sentiment_label' in ds_med.column_names: ds_med = ds_med.rename_column('sentiment_label', 'label')\n",
                "    \n",
                "    # 4. 合并数据集\n",
                "    common_cols = ['text', 'label']\n",
                "    combined = concatenate_datasets([ds_clap.select_columns(common_cols), ds_med.select_columns(common_cols)])\n",
                "    \n",
                "    # 5. 数据清洗与统一标签\n",
                "    def process_data(example):\n",
                "        # 统一标签为数字 0, 1, 2\n",
                "        lbl = example['label']\n",
                "        if isinstance(lbl, str):\n",
                "            lbl = lbl.lower()\n",
                "            if lbl in ['negative', '0']: lbl = 0\n",
                "            elif lbl in ['neutral', '1']: lbl = 1\n",
                "            elif lbl in ['positive', '2']: lbl = 2\n",
                "        return {'labels': int(lbl)}\n",
                "        \n",
                "    combined = combined.map(process_data)\n",
                "    \n",
                "    # 6. 分词 (Tokenization)\n",
                "    def tokenize(batch):\n",
                "        return tokenizer(batch['text'], padding=\"max_length\", truncation=True, max_length=Config.MAX_LENGTH)\n",
                "        \n",
                "    print(\"✂️ 正在进行分词处理...\")\n",
                "    tokenized_ds = combined.map(tokenize, batched=True)\n",
                "    \n",
                "    # 7. 划分训练集和验证集 (90% 训练, 10% 验证)\n",
                "    return tokenized_ds.train_test_split(test_size=0.1)\n",
                "\n",
                "# 执行数据准备\n",
                "dataset = prepare_dataset()\n",
                "print(f\"\\n✅ 数据准备完成!\\n训练集大小: {len(dataset['train'])} 条\\n测试集大小: {len(dataset['test'])} 条\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 4️⃣ 第四步:数据可视化 (Data Visualization)\n",
                "\n",
                "很多时候模型训练不好是因为数据分布不均匀(比如全是好评,那模型只要一直猜好评准确率也很高,但这没用)。\n",
                "让我们画个饼图来看看我们的数据怎么样。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# 从 dataset 中提取 label 列\n",
                "train_labels = dataset['train']['labels']\n",
                "\n",
                "# 统计每个类别的数量\n",
                "labels_count = pd.Series(train_labels).value_counts().sort_index()\n",
                "labels_name = [Config.ID2LABEL[i] for i in labels_count.index]\n",
                "\n",
                "# 由于 Matplotlib 默认不支持中文,我们用英文显示或者设置字体,这里为了简单直接用英文\n",
                "plt.figure(figsize=(8, 5))\n",
                "plt.pie(labels_count, labels=labels_name, autopct='%1.1f%%', colors=['#ff9999','#66b3ff','#99ff99'])\n",
                "plt.title('Training Data Distribution')\n",
                "plt.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 5️⃣ 第五步:模型训练 (Model Training)\n",
                "\n",
                "这是最激动人心的一步!我们将启动 Hugging Face `Trainer`。\n",
                "\n",
                "我们将实现一个**“智能跳过”**逻辑:如果检测到之前已经训练好了模型,就直接加载,不再浪费时间重新训练。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# 定义评价指标:我们需要知道模型的准确率(Accuracy)\n",
                "def compute_metrics(pred):\n",
                "    labels = pred.label_ids\n",
                "    preds = pred.predictions.argmax(-1)\n",
                "    acc = accuracy_score(labels, preds)\n",
                "    return {'accuracy': acc}\n",
                "\n",
                "# 检查是否已存在\n",
                "if os.path.exists(Config.OUTPUT_DIR) and os.path.exists(os.path.join(Config.OUTPUT_DIR, \"config.json\")):\n",
                "    print(f\"🎉 检测到已训练的模型: {Config.OUTPUT_DIR}\")\n",
                "    print(\"🚀 直接加载模型,跳过训练!\")\n",
                "    model = AutoModelForSequenceClassification.from_pretrained(Config.OUTPUT_DIR)\n",
                "    model.to(device)\n",
                "else:\n",
                "    print(\"💪 未找到已训练模型,开始新一轮训练...\")\n",
                "    \n",
                "    # 加载初始模型\n",
                "    model = AutoModelForSequenceClassification.from_pretrained(Config.BASE_MODEL, num_labels=Config.NUM_LABELS)\n",
                "    model.to(device)\n",
                "    \n",
                "    # 设置训练参数\n",
                "    training_args = TrainingArguments(\n",
                "        output_dir=Config.OUTPUT_DIR,\n",
                "        num_train_epochs=Config.NUM_EPOCHS,\n",
                "        per_device_train_batch_size=Config.BATCH_SIZE,\n",
                "        evaluation_strategy=\"epoch\", # 每个 Epoch 结束后评估一次\n",
                "        save_strategy=\"epoch\",       # 每个 Epoch 结束后保存一次\n",
                "        logging_steps=10,\n",
                "        report_to=\"none\"             # 不上报到wandb\n",
                "    )\n",
                "    \n",
                "    # 初始化训练器\n",
                "    trainer = Trainer(\n",
                "        model=model,\n",
                "        args=training_args,\n",
                "        train_dataset=dataset['train'],\n",
                "        eval_dataset=dataset['test'],\n",
                "        processing_class=tokenizer,\n",
                "        compute_metrics=compute_metrics\n",
                "    )\n",
                "    \n",
                "    # 开始训练!\n",
                "    trainer.train()\n",
                "    \n",
                "    # 保存最终结果\n",
                "    trainer.save_model(Config.OUTPUT_DIR)\n",
                "    tokenizer.save_pretrained(Config.OUTPUT_DIR)\n",
                "    print(\"💾 训练完成,模型已保存!\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 6️⃣ 第六步:互动测试 (Inference Demo)\n",
                "\n",
                "现在模型已经“毕业”了,让我们来考考它!\n",
                "在下面的输入框里随便输入一句话(支持中文),点击“分析”看看它觉得的情感是什么。"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import ipywidgets as widgets\n",
                "from IPython.display import display\n",
                "\n",
                "# 预测函数\n",
                "def predict_sentiment(text):\n",
                "    # 1. 预处理\n",
                "    inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, max_length=128, padding=True)\n",
                "    inputs = {k: v.to(device) for k, v in inputs.items()}\n",
                "    \n",
                "    # 2. 模型推理\n",
                "    with torch.no_grad():\n",
                "        outputs = model(**inputs)\n",
                "        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)\n",
                "        \n",
                "    # 3. 结果解析\n",
                "    pred_idx = torch.argmax(probs).item()\n",
                "    confidence = probs[0][pred_idx].item()\n",
                "    label = Config.ID2LABEL[pred_idx]\n",
                "    \n",
                "    return label, confidence\n",
                "\n",
                "# 界面组件\n",
                "text_box = widgets.Text(placeholder='请输入要分析的句子...', description='评论:', layout=widgets.Layout(width='400px'))\n",
                "btn_run = widgets.Button(description=\"开始分析\", button_style='primary')\n",
                "output_area = widgets.Output()\n",
                "\n",
                "def on_click(b):\n",
                "    with output_area:\n",
                "        output_area.clear_output()\n",
                "        text = text_box.value\n",
                "        if not text:\n",
                "            print(\"❌ 请先输入内容!\")\n",
                "            return\n",
                "        \n",
                "        print(f\"🔍 正在分析: \\\"{text}\\\"\")\n",
                "        label, conf = predict_sentiment(text)\n",
                "        \n",
                "        # 只有置信度高才显示绿色,否则显示黄色\n",
                "        icon = \"\" if conf > 0.8 else \"🤔\"\n",
                "        print(f\"{icon} 预测结果: [{label}] \")\n",
                "        print(f\"📊 置信度: {conf*100:.2f}%\")\n",
                "\n",
                "btn_run.on_click(on_click)\n",
                "display(text_box, btn_run, output_area)"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.0"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}