docs/TUTORIAL.md · Corolin/Chordia at main

File size: 19,932 Bytes

0a6452f

# 情绪与生理状态变化预测模型 - 完整使用教程

## 目录

1. [项目概述](#项目概述)
2. [安装指南](#安装指南)
3. [快速开始](#快速开始)
4. [数据准备](#数据准备)
5. [模型训练](#模型训练)
6. [模型推理](#模型推理)
7. [配置文件](#配置文件)
8. [命令行工具](#命令行工具)
9. [常见问题](#常见问题)
10. [故障排除](#故障排除)

## 项目概述

本项目是一个基于深度学习的情绪与生理状态变化预测模型，使用多层感知机(MLP)来预测用户情绪和生理状态的变化。

### 核心功能
- **输入**: 7维特征（User PAD 3维 + Vitality 1维 + Current PAD 3维）
- **输出**: 3维预测（ΔPAD：ΔPleasure, ΔArousal, ΔDominance）
- **模型**: 多层感知机(MLP)架构
- **支持**: 训练、推理、评估、性能基准测试

### 技术栈
- **深度学习框架**: PyTorch
- **数据处理**: NumPy, Pandas
- **可视化**: Matplotlib, Seaborn
- **配置管理**: YAML
- **命令行界面**: argparse

## 安装指南

### 系统要求
- Python 3.8 或更高版本
- CUDA支持（可选，用于GPU加速）

### 安装步骤

1. **克隆项目**
```bash
git clone <repository-url>
cd ann-playground
```

2. **创建虚拟环境**
```bash
python -m venv venv
source venv/bin/activate  # Linux/Mac
# 或
venv\Scripts\activate  # Windows
```

3. **安装依赖**
```bash
pip install -r requirements.txt
```

4. **验证安装**
```bash
python -c "import torch; print('PyTorch version:', torch.__version__)"
python -c "from src.models.pad_predictor import PADPredictor; print('Model import successful')"
```

### 依赖包说明

核心依赖：
- `torch`: 深度学习框架
- `numpy`: 数值计算
- `pandas`: 数据处理
- `matplotlib`, `seaborn`: 数据可视化
- `scikit-learn`: 机器学习工具
- `loguru`: 日志记录
- `pyyaml`: 配置文件解析
- `scipy`: 科学计算

## 快速开始

### 1. 运行快速开始教程

最简单的方式是运行快速开始教程：

```bash
cd examples
python quick_start.py
```

这将自动完成：
- 生成合成训练数据
- 训练一个基础模型
- 进行推理预测
- 解释预测结果

### 2. 使用预训练模型

如果你有预训练的模型文件，可以直接进行推理：

```python
from src.utils.inference_engine import create_inference_engine

# 创建推理引擎
engine = create_inference_engine(
    model_path="path/to/model.pth",
    preprocessor_path="path/to/preprocessor.pkl"
)

# 进行预测
input_data = [0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1]
result = engine.predict(input_data)
print(result)
```

### 3. 使用命令行工具

项目提供了完整的命令行工具：

```bash
# 训练模型
python -m src.cli.main train --config configs/training_config.yaml

# 进行预测
python -m src.cli.main predict --model model.pth --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

# 评估模型
python -m src.cli.main evaluate --model model.pth --data test_data.csv

# 推理脚本
python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

# 性能基准测试
python -m src.cli.main benchmark --model model.pth --num-samples 1000
```

## 数据准备

### 数据格式

#### 输入特征（7维）
| 特征名 | 类型 | 范围 | 说明 |
|--------|------|------|------|
| user_pleasure | float | [-1, 1] | 用户快乐度 |
| user_arousal | float | [-1, 1] | 用户激活度 |
| user_dominance | float | [-1, 1] | 用户支配度 |
| vitality | float | [0, 100] | 活力水平 |
| current_pleasure | float | [-1, 1] | 当前快乐度 |
| current_arousal | float | [-1, 1] | 当前激活度 |
| current_dominance | float | [-1, 1] | 当前支配度 |

#### 输出标签（3维）
| 标签名 | 类型 | 范围 | 说明 |
|--------|------|------|------|
| delta_pleasure | float | [-0.5, 0.5] | 快乐度变化量 |
| delta_arousal | float | [-0.5, 0.5] | 激活度变化量 |
| delta_dominance | float | [-0.5, 0.5] | 支配度变化量 |
| delta_pressure | float | [-0.3, 0.3] | 压力变化量 |
| confidence | float | [0, 1] | 预测置信度 |

### 数据文件格式

#### CSV格式
```csv
user_pleasure,user_arousal,user_dominance,vitality,current_pleasure,current_arousal,current_dominance,delta_pleasure,delta_arousal,delta_dominance,delta_pressure,confidence
0.5,0.3,-0.2,80.0,0.1,0.4,-0.1,-0.05,0.02,0.03,-0.02,0.85
-0.3,0.6,0.2,45.0,-0.1,0.7,0.1,0.08,-0.03,-0.01,0.05,0.72
...
```

#### JSON格式
```json
[
  {
    "user_pleasure": 0.5,
    "user_arousal": 0.3,
    "user_dominance": -0.2,
    "vitality": 80.0,
    "current_pleasure": 0.1,
    "current_arousal": 0.4,
    "current_dominance": -0.1,
    "delta_pleasure": -0.05,
    "delta_arousal": 0.02,
    "delta_dominance": 0.03,
    "delta_pressure": -0.02,
    "confidence": 0.85
  },
  ...
]
```

### 合成数据生成

项目提供了合成数据生成器：

```python
from src.data.synthetic_generator import SyntheticDataGenerator

# 创建数据生成器
generator = SyntheticDataGenerator(num_samples=1000, seed=42)

# 生成数据
features, labels = generator.generate_data()

# 保存数据
generator.save_data(features, labels, "output_data.csv", format='csv')
```

### 数据预处理

```python
from src.data.preprocessor import DataPreprocessor

# 创建预处理器
preprocessor = DataPreprocessor()

# 拟合预处理器
preprocessor.fit(train_features, train_labels)

# 转换数据
processed_features, processed_labels = preprocessor.transform(features, labels)

# 保存预处理器
preprocessor.save("preprocessor.pkl")
```

## 模型训练

### 基础训练

```python
from src.models.pad_predictor import PADPredictor
from src.utils.trainer import ModelTrainer
from torch.utils.data import DataLoader, TensorDataset

# 创建模型
model = PADPredictor(
    input_dim=7,
    output_dim=3,
    hidden_dims=[128, 64, 32],
    dropout_rate=0.3
)

# 创建数据加载器
dataset = TensorDataset(
    torch.FloatTensor(processed_features),
    torch.FloatTensor(processed_labels)
)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# 创建训练器
trainer = ModelTrainer(model, preprocessor)

# 训练配置
config = {
    'epochs': 100,
    'learning_rate': 0.001,
    'weight_decay': 1e-4,
    'patience': 10,
    'save_dir': './models'
}

# 开始训练
history = trainer.train(train_loader, val_loader, config)
```

### 使用配置文件训练

创建训练配置文件 `my_training_config.yaml`：

```yaml
training:
  epochs: 100
  learning_rate: 0.001
  weight_decay: 0.0001
  batch_size: 32
  
optimizer:
  type: "Adam"
  lr: 0.001
  weight_decay: 0.0001

scheduler:
  type: "ReduceLROnPlateau"
  patience: 5
  factor: 0.5
  
early_stopping:
  patience: 10
  min_delta: 0.001

data:
  train_ratio: 0.8
  val_ratio: 0.1
  test_ratio: 0.1
  shuffle: True
  seed: 42
```

运行训练：

```bash
python -m src.cli.main train --config my_training_config.yaml
```

### 训练监控

训练过程中会自动保存：
- 最佳模型检查点
- 训练历史记录
- 验证指标
- 学习率变化

可视化训练过程：

```python
import matplotlib.pyplot as plt

# 绘制损失曲线
plt.figure(figsize=(10, 6))
plt.plot(history['train_loss'], label='Training Loss')
plt.plot(history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
```

### 模型评估

```python
from src.models.metrics import RegressionMetrics

# 创建指标计算器
metrics_calculator = RegressionMetrics()

# 计算指标
metrics = metrics_calculator.calculate_all_metrics(
    true_labels, predictions
)

print(f"MSE: {metrics['mse']:.4f}")
print(f"MAE: {metrics['mae']:.4f}")
print(f"R²: {metrics['r2']:.4f}")
```

## 模型推理

### 单样本推理

```python
from src.utils.inference_engine import create_inference_engine

# 创建推理引擎
engine = create_inference_engine(
    model_path="models/best_model.pth",
    preprocessor_path="models/preprocessor.pkl"
)

# 单样本预测
input_data = [0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1]
result = engine.predict(input_data)

print(f"ΔPAD: {result['delta_pad']}")
print(f"ΔPressure: {result['delta_pressure']}")
print(f"Confidence: {result['confidence']}")
```

### 批量推理

```python
# 批量预测
batch_inputs = [
    [0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1],
    [-0.3, 0.6, 0.2, 45.0, -0.1, 0.7, 0.1],
    [0.8, -0.4, 0.6, 90.0, 0.7, -0.3, 0.5]
]

batch_results = engine.predict_batch(batch_inputs)

for i, result in enumerate(batch_results):
    print(f"Sample {i+1}: {result}")
```

### 从文件推理

```python
import pandas as pd

# 从CSV文件读取输入
input_df = pd.read_csv('input_data.csv')
results = engine.predict_batch(input_df.values.tolist())

# 保存结果
output_df = pd.DataFrame(results)
output_df.to_csv('output_results.csv', index=False)
```

### 性能优化

```python
# 预热模型（提高首次推理速度）
for _ in range(5):
    engine.predict([0.5, 0.3, -0.2, 75.0, 0.1, 0.4, -0.1])

# 性能基准测试
stats = engine.benchmark(num_samples=1000, batch_size=32)
print(f"Throughput: {stats['throughput']:.2f} samples/sec")
print(f"Average latency: {stats['avg_latency']:.2f}ms")
```

## 配置文件

### 模型配置 (`configs/model_config.yaml`)

```yaml
# 模型基本信息
model_info:
  name: "MLP_Emotion_Predictor"
  type: "MLP"
  version: "1.0"

# 输入输出维度
dimensions:
  input_dim: 7
  output_dim: 3

# 网络架构参数
architecture:
  hidden_layers:
    - size: 128
      activation: "ReLU"
      dropout: 0.2
    - size: 64
      activation: "ReLU"
      dropout: 0.2
    - size: 32
      activation: "ReLU"
      dropout: 0.1
  
  output_layer:
    activation: "Linear"

# 初始化参数
initialization:
  weight_init: "xavier_uniform"
  bias_init: "zeros"

# 正则化参数
regularization:
  weight_decay: 0.0001
  dropout_config:
    type: "standard"
    rate: 0.2
```

### 训练配置 (`configs/training_config.yaml`)

```yaml
# 训练参数
training:
  epochs: 100
  learning_rate: 0.001
  weight_decay: 0.0001
  batch_size: 32
  seed: 42

# 优化器配置
optimizer:
  type: "Adam"
  lr: 0.001
  weight_decay: 0.0001
  betas: [0.9, 0.999]

# 学习率调度器
scheduler:
  type: "ReduceLROnPlateau"
  patience: 5
  factor: 0.5
  min_lr: 1e-6

# 早停配置
early_stopping:
  patience: 10
  min_delta: 0.001
  monitor: "val_loss"

# 数据配置
data:
  train_ratio: 0.8
  val_ratio: 0.1
  test_ratio: 0.1
  shuffle: True
  num_workers: 4

# 保存配置
saving:
  save_dir: "./outputs"
  save_best_only: True
  checkpoint_interval: 10
```

### 数据配置 (`configs/data_config.yaml`)

```yaml
# 数据路径配置
paths:
  train_data: "data/train.csv"
  val_data: "data/val.csv"
  test_data: "data/test.csv"

# 数据预处理配置
preprocessing:
  normalize_features: True
  normalize_labels: True
  feature_scaler: "standard"  # standard, minmax, robust
  label_scaler: "standard"

# 数据增强配置
augmentation:
  enabled: False
  noise_std: 0.01
  augmentation_factor: 2

# 合成数据配置
synthetic_data:
  num_samples: 1000
  seed: 42
  add_noise: True
  add_correlations: True
```

## 命令行工具

### 训练命令

```bash
# 基础训练
python -m src.cli.main train --config configs/training_config.yaml

# 指定输出目录
python -m src.cli.main train --config configs/training_config.yaml --output-dir ./my_models

# 使用GPU训练
python -m src.cli.main train --config configs/training_config.yaml --device cuda

# 从检查点恢复训练
python -m src.cli.main train --config configs/training_config.yaml --resume checkpoints/epoch_50.pth

# 覆盖配置参数
python -m src.cli.main train --config configs/training_config.yaml --epochs 200 --batch-size 64 --learning-rate 0.0005
```

### 预测命令

```bash
# 交互式预测
python -m src.cli.main predict --model model.pth --interactive

# 快速预测
python -m src.cli.main predict --model model.pth --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

# 批量预测
python -m src.cli.main predict --model model.pth --batch input.csv --output results.csv

# 指定预处理器
python -m src.cli.main predict --model model.pth --preprocessor preprocessor.pkl --quick 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1
```

### 评估命令

```bash
# 基础评估
python -m src.cli.main evaluate --model model.pth --data test_data.csv

# 生成详细报告
python -m src.cli.main evaluate --model model.pth --data test_data.csv --report evaluation_report.html

# 指定评估指标
python -m src.cli.main evaluate --model model.pth --data test_data.csv --metrics mse mae r2

# 自定义批次大小
python -m src.cli.main evaluate --model model.pth --data test_data.csv --batch-size 64
```

### 推理命令

```bash
# 命令行输入推理
python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1

# JSON文件推理
python -m src.cli.main inference --model model.pth --input-json input.json --output-json output.json

# CSV文件推理
python -m src.cli.main inference --model model.pth --input-csv input.csv --output-csv output.csv

# 基准测试
python -m src.cli.main inference --model model.pth --benchmark --num-samples 1000

# 静默模式
python -m src.cli.main inference --model model.pth --input-cli 0.5 0.3 -0.2 75.0 0.1 0.4 -0.1 --quiet
```

### 基准测试命令

```bash
# 标准基准测试
python -m src.cli.main benchmark --model model.pth

# 自定义测试参数
python -m src.cli.main benchmark --model model.pth --num-samples 5000 --batch-size 64

# 生成性能报告
python -m src.cli.main benchmark --model model.pth --report performance_report.json

# 详细输出
python -m src.cli.main benchmark --model model.pth --verbose
```

## 常见问题

### Q1: 如何处理缺失值？
A: 项目目前不支持缺失值处理。请在数据预处理阶段使用以下方法：
```python
# 删除包含缺失值的行
df = df.dropna()

# 或填充缺失值
df = df.fillna(df.mean())  # 用均值填充
df = df.fillna(0)          # 用0填充
```

### Q2: 如何自定义模型架构？
A: 有两种方式自定义模型架构：

**方式1：修改配置文件**
```yaml
# 在 model_config.yaml 中修改
architecture:
  hidden_layers:
    - size: 256  # 增加神经元数量
      activation: "ReLU"
      dropout: 0.3
    - size: 128
      activation: "ReLU"
      dropout: 0.2
    - size: 64
      activation: "ReLU"
      dropout: 0.1
```

**方式2：直接创建模型**
```python
from src.models.pad_predictor import PADPredictor

model = PADPredictor(
    input_dim=7,
    output_dim=3,
    hidden_dims=[256, 128, 64, 32],  # 自定义隐藏层
    dropout_rate=0.3
)
```

### Q3: 如何处理类别特征？
A: 当前版本只支持数值特征。如果有类别特征，需要先进行编码：
```python
# One-Hot编码
df_encoded = pd.get_dummies(df, columns=['category_column'])

# 或标签编码
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category_column'])
```

### Q4: 如何提高模型性能？
A: 尝试以下方法：
1. **增加训练数据量**
2. **调整模型架构**（增加层数或神经元数量）
3. **优化超参数**（学习率、批次大小等）
4. **数据增强**（添加噪声或合成数据）
5. **正则化**（调整dropout、weight_decay）
6. **早停**（防止过拟合）

### Q5: 如何部署模型到生产环境？
A: 推荐的部署方式：
1. **保存模型和预处理器**
2. **创建推理服务**
3. **使用FastAPI或Flask封装**
4. **容器化部署**

示例：
```python
from fastapi import FastAPI
from src.utils.inference_engine import create_inference_engine

app = FastAPI()
engine = create_inference_engine("model.pth", "preprocessor.pkl")

@app.post("/predict")
async def predict(input_data: list):
    result = engine.predict(input_data)
    return result
```

### Q6: 如何处理大规模数据？
A: 对于大规模数据：
1. **使用数据生成器**（DataGenerator）
2. **分批处理**
3. **使用多进程数据加载**
4. **考虑使用分布式训练**

```python
# 使用多进程数据加载
train_loader = DataLoader(
    dataset, 
    batch_size=32, 
    shuffle=True,
    num_workers=4,  # 多进程
    pin_memory=True  # GPU内存优化
)
```

### Q7: 如何可视化预测结果？
A: 项目提供了多种可视化方法：
```python
import matplotlib.pyplot as plt
import seaborn as sns

# 预测值vs真实值散点图
plt.scatter(true_labels, predictions, alpha=0.6)
plt.plot([min_val, max_val], [min_val, max_val], 'r--')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()

# 残差图
residuals = true_labels - predictions
plt.hist(residuals, bins=30)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
```

### Q8: 如何进行模型版本管理？
A: 建议的版本管理策略：
1. **使用语义化版本号**
2. **保存训练配置和超参数**
3. **记录模型性能指标**
4. **使用模型注册表**

```python
# 保存模型信息
model_info = {
    'version': '1.2.0',
    'architecture': str(model),
    'training_config': config,
    'performance': metrics,
    'created_at': datetime.now().isoformat()
}

torch.save({
    'model_state_dict': model.state_dict(),
    'model_info': model_info
}, f'model_v{model_info["version"]}.pth')
```

## 故障排除

### 常见错误及解决方案

#### 1. CUDA内存不足
```
RuntimeError: CUDA out of memory
```
**解决方案**：
- 减小批次大小
- 使用CPU训练：`--device cpu`
- 清理GPU缓存：`torch.cuda.empty_cache()`

#### 2. 模型加载失败
```
FileNotFoundError: [Errno 2] No such file or directory: 'model.pth'
```
**解决方案**：
- 检查文件路径是否正确
- 确保模型文件存在
- 检查文件权限

#### 3. 数据维度不匹配
```
RuntimeError: mat1 and mat2 shapes cannot be multiplied
```
**解决方案**：
- 检查输入数据维度（应为7维）
- 确保数据预处理正确
- 验证模型配置

#### 4. 导入错误
```
ModuleNotFoundError: No module named 'src.xxx'
```
**解决方案**：
- 检查Python路径设置
- 确保在项目根目录运行
- 重新安装依赖包

#### 5. 配置文件错误
```
yaml.scanner.ScannerError: while scanning for the next token
```
**解决方案**：
- 检查YAML文件语法
- 确保缩进正确
- 使用YAML验证工具

### 调试技巧

#### 1. 启用详细日志
```bash
python -m src.cli.main train --config configs/training_config.yaml --verbose --log-level DEBUG
```

#### 2. 使用小数据集测试
```python
# 使用少量数据快速测试
generator = SyntheticDataGenerator(num_samples=100, seed=42)
features, labels = generator.generate_data()
```

#### 3. 检查模型输出
```python
# 检查模型输出形状
model.eval()
with torch.no_grad():
    sample_input = torch.randn(1, 7)
    output = model(sample_input)
    print(f"Output shape: {output.shape}")
    print(f"Output range: [{output.min():.3f}, {output.max():.3f}]")
```

#### 4. 验证数据预处理
```python
# 检查预处理后的数据
print(f"Features mean: {processed_features.mean(axis=0)}")
print(f"Features std: {processed_features.std(axis=0)}")
print(f"Labels mean: {processed_labels.mean(axis=0)}")
print(f"Labels std: {processed_labels.std(axis=0)}")
```

### 性能优化建议

#### 1. 训练优化
- 使用合适的批次大小
- 启用混合精度训练（AMP）
- 使用学习率调度器
- 实施早停机制

#### 2. 推理优化
- 模型预热
- 批量推理
- 模型量化
- 使用ONNX格式

#### 3. 内存优化
- 使用数据生成器
- 及时释放不需要的变量
- 使用梯度累积

---

## 联系方式

如有其他问题或需要帮助，请通过以下方式联系：
- 项目仓库: [GitHub仓库链接]
- 问题反馈: [Issues链接]
- 文档: [文档链接]
- 邮箱: [联系邮箱]

---

**注意**: 本教程基于项目当前版本编写，随着项目更新，部分内容可能会有变化。请及时查看最新文档。