docs/TUTORIAL.md · Corolin/Chordia at main

Chordia / docs /TUTORIAL.md

Corolin

first commit

0a6452f 3 days ago

preview code

raw

history blame contribute delete

19.9 kB

	# 情绪与生理状态变化预测模型 - 完整使用教程

	## 目录

	1. [项目概述](#项目概述)
	2. [安装指南](#安装指南)
	3. [快速开始](#快速开始)
	4. [数据准备](#数据准备)
	5. [模型训练](#模型训练)
	6. [模型推理](#模型推理)
	7. [配置文件](#配置文件)
	8. [命令行工具](#命令行工具)
	9. [常见问题](#常见问题)
	10. [故障排除](#故障排除)

	## 项目概述

	本项目是一个基于深度学习的情绪与生理状态变化预测模型，使用多层感知机(MLP)来预测用户情绪和生理状态的变化。

	### 核心功能
	- 输入: 7维特征（User PAD 3维 + Vitality 1维 + Current PAD 3维）
	- 输出: 3维预测（ΔPAD：ΔPleasure, ΔArousal, ΔDominance）
	- 模型: 多层感知机(MLP)架构
	- 支持: 训练、推理、评估、性能基准测试

	### 技术栈
	- 深度学习框架: PyTorch
	- 数据处理: NumPy, Pandas
	- 可视化: Matplotlib, Seaborn
	- 配置管理: YAML
	- 命令行界面: argparse

	## 安装指南

	### 系统要求
	- Python 3.8 或更高版本
	- CUDA支持（可选，用于GPU加速）

	### 安装步骤

	1. 克隆项目
	```bash
	git clone <repository-url>
	cd ann-playground
	```

	2. 创建虚拟环境
	```bash
	python -m venv venv
	source venv/bin/activate # Linux/Mac
	# 或
	venv\Scripts\activate # Windows
	```

	3. 安装依赖
	```bash
	pip install -r requirements.txt
	```

	4. 验证安装
	```bash
	python -c "import torch; print('PyTorch version:', torch.__version__)"
	python -c "from src.models.pad_predictor import PADPredictor; print('Model import successful')"
	```

	### 依赖包说明

	核心依赖：
	- `torch`: 深度学习框架
	- `numpy`: 数值计算
	- `pandas`: 数据处理
	- `matplotlib`, `seaborn`: 数据可视化
	- `scikit-learn`: 机器学习工具
	- `loguru`: 日志记录
	- `pyyaml`: 配置文件解析
	- `scipy`: 科学计算

	## 快速开始

	### 1. 运行快速开始教程

	最简单的方式是运行快速开始教程：

	```bash
	cd examples
	python quick_start.py
	```

	这将自动完成：
	- 生成合成训练数据
	- 训练一个基础模型
	- 进行推理预测
	- 解释预测结果

	### 2. 使用预训练模型

	如果你有预训练的模型文件，可以直接进行推理：

	```python
	from src.utils.inference_engine import create_inference_engine

	# 创建推理引擎
	engine = create_inference_engine(
	model_path="path/to/model.pth",
	preprocessor_path="path/to/preprocessor.pkl"
	)

	# 进行预测
	input_data = [0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1]
	result = engine.predict(input_data)
	print(result)
	```

	### 3. 使用命令行工具

	项目提供了完整的命令行工具：

	```bash
	# 训练模型
	python -m src.cli.main train --config configs/training_config.yaml

	# 进行预测
	python -m src.cli.main predict --model model.pth --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

	# 评估模型
	python -m src.cli.main evaluate --model model.pth --data test_data.csv

	# 推理脚本
	python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

	# 性能基准测试
	python -m src.cli.main benchmark --model model.pth --num-samples 1000
	```

	## 数据准备

	### 数据格式

	#### 输入特征（7维）
	\| 特征名 \| 类型 \| 范围 \| 说明 \|
	\|--------\|------\|------\|------\|
	\| user_pleasure \| float \| [-1, 1] \| 用户快乐度 \|
	\| user_arousal \| float \| [-1, 1] \| 用户激活度 \|
	\| user_dominance \| float \| [-1, 1] \| 用户支配度 \|
	\| vitality \| float \| [0, 100] \| 活力水平 \|
	\| current_pleasure \| float \| [-1, 1] \| 当前快乐度 \|
	\| current_arousal \| float \| [-1, 1] \| 当前激活度 \|
	\| current_dominance \| float \| [-1, 1] \| 当前支配度 \|

	#### 输出标签（3维）
	\| 标签名 \| 类型 \| 范围 \| 说明 \|
	\|--------\|------\|------\|------\|
	\| delta_pleasure \| float \| [-0.5, 0.5] \| 快乐度变化量 \|
	\| delta_arousal \| float \| [-0.5, 0.5] \| 激活度变化量 \|
	\| delta_dominance \| float \| [-0.5, 0.5] \| 支配度变化量 \|
	\| delta_pressure \| float \| [-0.3, 0.3] \| 压力变化量 \|
	\| confidence \| float \| [0, 1] \| 预测置信度 \|

	### 数据文件格式

	#### CSV格式
	```csv
	user_pleasure,user_arousal,user_dominance,vitality,current_pleasure,current_arousal,current_dominance,delta_pleasure,delta_arousal,delta_dominance,delta_pressure,confidence
	0.5,0.3,-0.2,80.0,0.1,0.4,-0.1,-0.05,0.02,0.03,-0.02,0.85
	-0.3,0.6,0.2,45.0,-0.1,0.7,0.1,0.08,-0.03,-0.01,0.05,0.72
	...
	```

	#### JSON格式
	```json
	[
	{
	"user_pleasure": 0.5,
	"user_arousal": 0.3,
	"user_dominance": -0.2,
	"vitality": 80.0,
	"current_pleasure": 0.1,
	"current_arousal": 0.4,
	"current_dominance": -0.1,
	"delta_pleasure": -0.05,
	"delta_arousal": 0.02,
	"delta_dominance": 0.03,
	"delta_pressure": -0.02,
	"confidence": 0.85
	},
	...
	]
	```

	### 合成数据生成

	项目提供了合成数据生成器：

	```python
	from src.data.synthetic_generator import SyntheticDataGenerator

	# 创建数据生成器
	generator = SyntheticDataGenerator(num_samples=1000, seed=42)

	# 生成数据
	features, labels = generator.generate_data()

	# 保存数据
	generator.save_data(features, labels, "output_data.csv", format='csv')
	```

	### 数据预处理

	```python
	from src.data.preprocessor import DataPreprocessor

	# 创建预处理器
	preprocessor = DataPreprocessor()

	# 拟合预处理器
	preprocessor.fit(train_features, train_labels)

	# 转换数据
	processed_features, processed_labels = preprocessor.transform(features, labels)

	# 保存预处理器
	preprocessor.save("preprocessor.pkl")
	```

	## 模型训练

	### 基础训练

	```python
	from src.models.pad_predictor import PADPredictor
	from src.utils.trainer import ModelTrainer
	from torch.utils.data import DataLoader, TensorDataset

	# 创建模型
	model = PADPredictor(
	input_dim=7,
	output_dim=3,
	hidden_dims=[128, 64, 32],
	dropout_rate=0.3
	)

	# 创建数据加载器
	dataset = TensorDataset(
	torch.FloatTensor(processed_features),
	torch.FloatTensor(processed_labels)
	)
	train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

	# 创建训练器
	trainer = ModelTrainer(model, preprocessor)

	# 训练配置
	config = {
	'epochs': 100,
	'learning_rate': 0.001,
	'weight_decay': 1e-4,
	'patience': 10,
	'save_dir': './models'
	}

	# 开始训练
	history = trainer.train(train_loader, val_loader, config)
	```

	### 使用配置文件训练

	创建训练配置文件 `my_training_config.yaml`：

	```yaml
	training:
	epochs: 100
	learning_rate: 0.001
	weight_decay: 0.0001
	batch_size: 32

	optimizer:
	type: "Adam"
	lr: 0.001
	weight_decay: 0.0001

	scheduler:
	type: "ReduceLROnPlateau"
	patience: 5
	factor: 0.5

	early_stopping:
	patience: 10
	min_delta: 0.001

	data:
	train_ratio: 0.8
	val_ratio: 0.1
	test_ratio: 0.1
	shuffle: True
	seed: 42
	```

	运行训练：

	```bash
	python -m src.cli.main train --config my_training_config.yaml
	```

	### 训练监控

	训练过程中会自动保存：
	- 最佳模型检查点
	- 训练历史记录
	- 验证指标
	- 学习率变化

	可视化训练过程：

	```python
	import matplotlib.pyplot as plt

	# 绘制损失曲线
	plt.figure(figsize=(10, 6))
	plt.plot(history['train_loss'], label='Training Loss')
	plt.plot(history['val_loss'], label='Validation Loss')
	plt.xlabel('Epoch')
	plt.ylabel('Loss')
	plt.legend()
	plt.show()
	```

	### 模型评估

	```python
	from src.models.metrics import RegressionMetrics

	# 创建指标计算器
	metrics_calculator = RegressionMetrics()

	# 计算指标
	metrics = metrics_calculator.calculate_all_metrics(
	true_labels, predictions
	)

	print(f"MSE: {metrics['mse']:.4f}")
	print(f"MAE: {metrics['mae']:.4f}")
	print(f"R²: {metrics['r2']:.4f}")
	```

	## 模型推理

	### 单样本推理

	```python
	from src.utils.inference_engine import create_inference_engine

	# 创建推理引擎
	engine = create_inference_engine(
	model_path="models/best_model.pth",
	preprocessor_path="models/preprocessor.pkl"
	)

	# 单样本预测
	input_data = [0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1]
	result = engine.predict(input_data)

	print(f"ΔPAD: {result['delta_pad']}")
	print(f"ΔPressure: {result['delta_pressure']}")
	print(f"Confidence: {result['confidence']}")
	```

	### 批量推理

	```python
	# 批量预测
	batch_inputs = [
	[0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1],
	[-0.3, 0.6, 0.2, 45.0, -0.1, 0.7, 0.1],
	[0.8, -0.4, 0.6, 90.0, 0.7, -0.3, 0.5]
	]

	batch_results = engine.predict_batch(batch_inputs)

	for i, result in enumerate(batch_results):
	print(f"Sample {i+1}: {result}")
	```

	### 从文件推理

	```python
	import pandas as pd

	# 从CSV文件读取输入
	input_df = pd.read_csv('input_data.csv')
	results = engine.predict_batch(input_df.values.tolist())

	# 保存结果
	output_df = pd.DataFrame(results)
	output_df.to_csv('output_results.csv', index=False)
	```

	### 性能优化

	```python
	# 预热模型（提高首次推理速度）
	for _ in range(5):
	engine.predict([0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1])

	# 性能基准测试
	stats = engine.benchmark(num_samples=1000, batch_size=32)
	print(f"Throughput: {stats['throughput']:.2f} samples/sec")
	print(f"Average latency: {stats['avg_latency']:.2f}ms")
	```

	## 配置文件

	### 模型配置 (`configs/model_config.yaml`)

	```yaml
	# 模型基本信息
	model_info:
	name: "MLP_Emotion_Predictor"
	type: "MLP"
	version: "1.0"

	# 输入输出维度
	dimensions:
	input_dim: 7
	output_dim: 3

	# 网络架构参数
	architecture:
	hidden_layers:
	- size: 128
	activation: "ReLU"
	dropout: 0.2
	- size: 64
	activation: "ReLU"
	dropout: 0.2
	- size: 32
	activation: "ReLU"
	dropout: 0.1

	output_layer:
	activation: "Linear"

	# 初始化参数
	initialization:
	weight_init: "xavier_uniform"
	bias_init: "zeros"

	# 正则化参数
	regularization:
	weight_decay: 0.0001
	dropout_config:
	type: "standard"
	rate: 0.2
	```

	### 训练配置 (`configs/training_config.yaml`)

	```yaml
	# 训练参数
	training:
	epochs: 100
	learning_rate: 0.001
	weight_decay: 0.0001
	batch_size: 32
	seed: 42

	# 优化器配置
	optimizer:
	type: "Adam"
	lr: 0.001
	weight_decay: 0.0001
	betas: [0.9, 0.999]

	# 学习率调度器
	scheduler:
	type: "ReduceLROnPlateau"
	patience: 5
	factor: 0.5
	min_lr: 1e-6

	# 早停配置
	early_stopping:
	patience: 10
	min_delta: 0.001
	monitor: "val_loss"

	# 数据配置
	data:
	train_ratio: 0.8
	val_ratio: 0.1
	test_ratio: 0.1
	shuffle: True
	num_workers: 4

	# 保存配置
	saving:
	save_dir: "./outputs"
	save_best_only: True
	checkpoint_interval: 10
	```

	### 数据配置 (`configs/data_config.yaml`)

	```yaml
	# 数据路径配置
	paths:
	train_data: "data/train.csv"
	val_data: "data/val.csv"
	test_data: "data/test.csv"

	# 数据预处理配置
	preprocessing:
	normalize_features: True
	normalize_labels: True
	feature_scaler: "standard" # standard, minmax, robust
	label_scaler: "standard"

	# 数据增强配置
	augmentation:
	enabled: False
	noise_std: 0.01
	augmentation_factor: 2

	# 合成数据配置
	synthetic_data:
	num_samples: 1000
	seed: 42
	add_noise: True
	add_correlations: True
	```

	## 命令行工具

	### 训练命令

	```bash
	# 基础训练
	python -m src.cli.main train --config configs/training_config.yaml

	# 指定输出目录
	python -m src.cli.main train --config configs/training_config.yaml --output-dir ./my_models

	# 使用GPU训练
	python -m src.cli.main train --config configs/training_config.yaml --device cuda

	# 从检查点恢复训练
	python -m src.cli.main train --config configs/training_config.yaml --resume checkpoints/epoch_50.pth

	# 覆盖配置参数
	python -m src.cli.main train --config configs/training_config.yaml --epochs 200 --batch-size 64 --learning-rate 0.0005
	```

	### 预测命令

	```bash
	# 交互式预测
	python -m src.cli.main predict --model model.pth --interactive

	# 快速预测
	python -m src.cli.main predict --model model.pth --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

	# 批量预测
	python -m src.cli.main predict --model model.pth --batch input.csv --output results.csv

	# 指定预处理器
	python -m src.cli.main predict --model model.pth --preprocessor preprocessor.pkl --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1
	```

	### 评估命令

	```bash
	# 基础评估
	python -m src.cli.main evaluate --model model.pth --data test_data.csv

	# 生成详细报告
	python -m src.cli.main evaluate --model model.pth --data test_data.csv --report evaluation_report.html

	# 指定评估指标
	python -m src.cli.main evaluate --model model.pth --data test_data.csv --metrics mse mae r2

	# 自定义批次大小
	python -m src.cli.main evaluate --model model.pth --data test_data.csv --batch-size 64
	```

	### 推理命令

	```bash
	# 命令行输入推理
	python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

	# JSON文件推理
	python -m src.cli.main inference --model model.pth --input-json input.json --output-json output.json

	# CSV文件推理
	python -m src.cli.main inference --model model.pth --input-csv input.csv --output-csv output.csv

	# 基准测试
	python -m src.cli.main inference --model model.pth --benchmark --num-samples 1000

	# 静默模式
	python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1 --quiet
	```

	### 基准测试命令

	```bash
	# 标准基准测试
	python -m src.cli.main benchmark --model model.pth

	# 自定义测试参数
	python -m src.cli.main benchmark --model model.pth --num-samples 5000 --batch-size 64

	# 生成性能报告
	python -m src.cli.main benchmark --model model.pth --report performance_report.json

	# 详细输出
	python -m src.cli.main benchmark --model model.pth --verbose
	```

	## 常见问题

	### Q1: 如何处理缺失值？
	A: 项目目前不支持缺失值处理。请在数据预处理阶段使用以下方法：
	```python
	# 删除包含缺失值的行
	df = df.dropna()

	# 或填充缺失值
	df = df.fillna(df.mean()) # 用均值填充
	df = df.fillna(0) # 用0填充
	```

	### Q2: 如何自定义模型架构？
	A: 有两种方式自定义模型架构：

	方式1：修改配置文件
	```yaml
	# 在 model_config.yaml 中修改
	architecture:
	hidden_layers:
	- size: 256 # 增加神经元数量
	activation: "ReLU"
	dropout: 0.3
	- size: 128
	activation: "ReLU"
	dropout: 0.2
	- size: 64
	activation: "ReLU"
	dropout: 0.1
	```

	方式2：直接创建模型
	```python
	from src.models.pad_predictor import PADPredictor

	model = PADPredictor(
	input_dim=7,
	output_dim=3,
	hidden_dims=[256, 128, 64, 32], # 自定义隐藏层
	dropout_rate=0.3
	)
	```

	### Q3: 如何处理类别特征？
	A: 当前版本只支持数值特征。如果有类别特征，需要先进行编码：
	```python
	# One-Hot编码
	df_encoded = pd.get_dummies(df, columns=['category_column'])

	# 或标签编码
	from sklearn.preprocessing import LabelEncoder
	le = LabelEncoder()
	df['category_encoded'] = le.fit_transform(df['category_column'])
	```

	### Q4: 如何提高模型性能？
	A: 尝试以下方法：
	1. 增加训练数据量
	2. 调整模型架构（增加层数或神经元数量）
	3. 优化超参数（学习率、批次大小等）
	4. 数据增强（添加噪声或合成数据）
	5. 正则化（调整dropout、weight_decay）
	6. 早停（防止过拟合）

	### Q5: 如何部署模型到生产环境？
	A: 推荐的部署方式：
	1. 保存模型和预处理器
	2. 创建推理服务
	3. 使用FastAPI或Flask封装
	4. 容器化部署

	示例：
	```python
	from fastapi import FastAPI
	from src.utils.inference_engine import create_inference_engine

	app = FastAPI()
	engine = create_inference_engine("model.pth", "preprocessor.pkl")

	@app.post("/predict")
	async def predict(input_data: list):
	result = engine.predict(input_data)
	return result
	```

	### Q6: 如何处理大规模数据？
	A: 对于大规模数据：
	1. 使用数据生成器（DataGenerator）
	2. 分批处理
	3. 使用多进程数据加载
	4. 考虑使用分布式训练

	```python
	# 使用多进程数据加载
	train_loader = DataLoader(
	dataset,
	batch_size=32,
	shuffle=True,
	num_workers=4, # 多进程
	pin_memory=True # GPU内存优化
	)
	```

	### Q7: 如何可视化预测结果？
	A: 项目提供了多种可视化方法：
	```python
	import matplotlib.pyplot as plt
	import seaborn as sns

	# 预测值vs真实值散点图
	plt.scatter(true_labels, predictions, alpha=0.6)
	plt.plot([min_val, max_val], [min_val, max_val], 'r--')
	plt.xlabel('True Values')
	plt.ylabel('Predictions')
	plt.show()

	# 残差图
	residuals = true_labels - predictions
	plt.hist(residuals, bins=30)
	plt.xlabel('Residuals')
	plt.ylabel('Frequency')
	plt.show()
	```

	### Q8: 如何进行模型版本管理？
	A: 建议的版本管理策略：
	1. 使用语义化版本号
	2. 保存训练配置和超参数
	3. 记录模型性能指标
	4. 使用模型注册表

	```python
	# 保存模型信息
	model_info = {
	'version': '1.2.0',
	'architecture': str(model),
	'training_config': config,
	'performance': metrics,
	'created_at': datetime.now().isoformat()
	}

	torch.save({
	'model_state_dict': model.state_dict(),
	'model_info': model_info
	}, f'model_v{model_info["version"]}.pth')
	```

	## 故障排除

	### 常见错误及解决方案

	#### 1. CUDA内存不足
	```
	RuntimeError: CUDA out of memory
	```
	解决方案：
	- 减小批次大小
	- 使用CPU训练：`--device cpu`
	- 清理GPU缓存：`torch.cuda.empty_cache()`

	#### 2. 模型加载失败
	```
	FileNotFoundError: [Errno 2] No such file or directory: 'model.pth'
	```
	解决方案：
	- 检查文件路径是否正确
	- 确保模型文件存在
	- 检查文件权限

	#### 3. 数据维度不匹配
	```
	RuntimeError: mat1 and mat2 shapes cannot be multiplied
	```
	解决方案：
	- 检查输入数据维度（应为7维）
	- 确保数据预处理正确
	- 验证模型配置

	#### 4. 导入错误
	```
	ModuleNotFoundError: No module named 'src.xxx'
	```
	解决方案：
	- 检查Python路径设置
	- 确保在项目根目录运行
	- 重新安装依赖包

	#### 5. 配置文件错误
	```
	yaml.scanner.ScannerError: while scanning for the next token
	```
	解决方案：
	- 检查YAML文件语法
	- 确保缩进正确
	- 使用YAML验证工具

	### 调试技巧

	#### 1. 启用详细日志
	```bash
	python -m src.cli.main train --config configs/training_config.yaml --verbose --log-level DEBUG
	```

	#### 2. 使用小数据集测试
	```python
	# 使用少量数据快速测试
	generator = SyntheticDataGenerator(num_samples=100, seed=42)
	features, labels = generator.generate_data()
	```

	#### 3. 检查模型输出
	```python
	# 检查模型输出形状
	model.eval()
	with torch.no_grad():
	sample_input = torch.randn(1, 7)
	output = model(sample_input)
	print(f"Output shape: {output.shape}")
	print(f"Output range: [{output.min():.3f}, {output.max():.3f}]")
	```

	#### 4. 验证数据预处理
	```python
	# 检查预处理后的数据
	print(f"Features mean: {processed_features.mean(axis=0)}")
	print(f"Features std: {processed_features.std(axis=0)}")
	print(f"Labels mean: {processed_labels.mean(axis=0)}")
	print(f"Labels std: {processed_labels.std(axis=0)}")
	```

	### 性能优化建议

	#### 1. 训练优化
	- 使用合适的批次大小
	- 启用混合精度训练（AMP）
	- 使用学习率调度器
	- 实施早停机制

	#### 2. 推理优化
	- 模型预热
	- 批量推理
	- 模型量化
	- 使用ONNX格式

	#### 3. 内存优化
	- 使用数据生成器
	- 及时释放不需要的变量
	- 使用梯度累积

	---

	## 联系方式

	如有其他问题或需要帮助，请通过以下方式联系：
	- 项目仓库: [GitHub仓库链接]
	- 问题反馈: [Issues链接]
	- 文档: [文档链接]
	- 邮箱: [联系邮箱]

	---

	注意: 本教程基于项目当前版本编写，随着项目更新，部分内容可能会有变化。请及时查看最新文档。