docs/CONFIGURATION.md · Corolin/Chordia at main

File size: 28,292 Bytes

0a6452f


# 配置文件说明文档

本文档详细介绍了情绪与生理状态变化预测模型的所有配置选项、参数说明和使用示例。

## 目录

1. [配置系统概述](#配置系统概述)
2. [模型配置](#模型配置)
3. [训练配置](#训练配置)
4. [数据配置](#数据配置)
5. [推理配置](#推理配置)
6. [日志配置](#日志配置)
7. [硬件配置](#硬件配置)
8. [实验跟踪配置](#实验跟踪配置)
9. [配置最佳实践](#配置最佳实践)
10. [配置验证](#配置验证)

## 配置系统概述

### 配置文件格式

项目使用YAML格式的配置文件，支持：
- 层次化结构
- 注释支持
- 变量引用
- 环境变量替换
- 配置继承

### 配置文件加载顺序

1. 默认配置 (内置)
2. 全局配置文件 (`~/.emotion-prediction/config.yaml`)
3. 项目配置文件 (`configs/`)
4. 命令行参数覆盖

### 配置管理器

```python
from src.utils.config import ConfigManager

# 加载配置
config_manager = ConfigManager()
config = config_manager.load_config("configs/training_config.yaml")

# 访问配置
learning_rate = config.training.optimizer.learning_rate
batch_size = config.training.batch_size

# 配置验证
config_manager.validate_config(config)
```

## 模型配置

### 主配置文件: `configs/model_config.yaml`

```yaml
# ========================================
# 模型配置文件
# ========================================

# 模型基本信息
model_info:
  name: "MLP_Emotion_Predictor"
  type: "MLP"
  version: "1.0"
  description: "基于MLP的情绪与生理状态变化预测模型"
  author: "Research Team"
  
# 输入输出维度配置
dimensions:
  input_dim: 7    # 输入维度：User PAD 3维 + Vitality 1维 + Current PAD 3维
  output_dim: 3   # 输出维度：ΔPAD 3维（ΔPleasure, ΔArousal, ΔDominance）
  
# 网络架构配置
architecture:
  # 隐藏层配置
  hidden_layers:
    - size: 128
      activation: "ReLU"
      dropout: 0.2
      batch_norm: false
      layer_norm: false
    - size: 64
      activation: "ReLU"
      dropout: 0.2
      batch_norm: false
      layer_norm: false
    - size: 32
      activation: "ReLU"
      dropout: 0.1
      batch_norm: false
      layer_norm: false
  
  # 输出层配置
  output_layer:
    activation: "Linear"  # 线性激活，用于回归任务
    
  # 正则化配置
  use_batch_norm: false
  use_layer_norm: false
  
# 权重初始化配置
initialization:
  weight_init: "xavier_uniform"  # 可选: xavier_uniform, xavier_normal, kaiming_uniform, kaiming_normal
  bias_init: "zeros"             # 可选: zeros, ones, uniform, normal
  
# 正则化配置
regularization:
  # L2正则化
  weight_decay: 0.0001
  
  # Dropout配置
  dropout_config:
    type: "standard"      # 标准 dropout
    rate: 0.2            # Dropout 概率
    
  # 批归一化
  batch_norm_config:
    momentum: 0.1
    eps: 1e-5
    
# 模型保存配置
model_saving:
  save_best_only: true          # 只保存最佳模型
  save_format: "pytorch"        # 保存格式: pytorch, onnx, torchscript
  checkpoint_interval: 10       # 每10个epoch保存一次检查点
  max_checkpoints: 5           # 最多保存5个检查点
  
# PAD情绪空间特殊配置
emotion_model:
  # PAD值的范围限制
  pad_space:
    pleasure_range: [-1.0, 1.0]     # 快乐维度范围
    arousal_range: [-1.0, 1.0]      # 激活度维度范围  
    dominance_range: [-1.0, 1.0]    # 支配度维度范围
    
  # 生理指标配置
  vitality:
    range: [0.0, 100.0]             # 活力值范围
    normalization: "min_max"        # 标准化方法: min_max, z_score, robust
    
  # 预测输出配置
  prediction:
    # ΔPAD的变化范围限制
    delta_pad_range: [-0.5, 0.5]    # PAD变化的合理范围
    # 压力值变化范围
    delta_pressure_range: [-0.3, 0.3]
    # 置信度范围
    confidence_range: [0.0, 1.0]
```

### 模型配置参数详解

#### `model_info` 模型基本信息

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `name` | str | 是 | - | 模型名称 |
| `type` | str | 是 | - | 模型类型 (MLP, CNN, RNN等) |
| `version` | str | 是 | - | 模型版本号 |
| `description` | str | 否 | - | 模型描述 |
| `author` | str | 否 | - | 作者信息 |

#### `dimensions` 输入输出维度

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `input_dim` | int | 是 | 7 | 输入特征维度 |
| `output_dim` | int | 是 | 3 | 输出预测维度（ΔPAD 3维） |

#### `architecture` 网络架构

##### `hidden_layers` 隐藏层配置

每个隐藏层支持以下参数：

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `size` | int | 是 | - | 神经元数量 |
| `activation` | str | 否 | ReLU | 激活函数 |
| `dropout` | float | 否 | 0.0 | Dropout概率 |
| `batch_norm` | bool | 否 | false | 是否使用批归一化 |
| `layer_norm` | bool | 否 | false | 是否使用层归一化 |

**激活函数选项**:
- `ReLU`: 修正线性单元
- `LeakyReLU`: 泄漏ReLU
- `Tanh`: 双曲正切
- `Sigmoid`: Sigmoid函数
- `GELU`: 高斯误差线性单元
- `Swish`: Swish激活函数

##### `output_layer` 输出层配置

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `activation` | str | 否 | Linear | 输出激活函数 |

#### `initialization` 权重初始化

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `weight_init` | str | 否 | xavier_uniform | 权重初始化方法 |
| `bias_init` | str | 否 | zeros | 偏置初始化方法 |

**权重初始化选项**:
- `xavier_uniform`: Xavier均匀初始化
- `xavier_normal`: Xavier正态初始化
- `kaiming_uniform`: Kaiming均匀初始化 (适合ReLU)
- `kaiming_normal`: Kaiming正态初始化 (适合ReLU)
- `uniform`: 均匀分布初始化
- `normal`: 正态分布初始化

## 训练配置

### 主配置文件: `configs/training_config.yaml`

```yaml
# ========================================
# 训练配置文件
# ========================================

# 训练基本信息
training_info:
  experiment_name: "emotion_prediction_v1"
  description: "基于MLP的情绪与生理状态变化预测模型训练"
  seed: 42
  tags: ["baseline", "mlp", "emotion_prediction"]
  
# 数据配置
data:
  # 数据路径
  paths:
    train_data: "data/train.csv"
    val_data: "data/val.csv"
    test_data: "data/test.csv"
    
  # 数据预处理
  preprocessing:
    # 特征标准化
    feature_scaling:
      method: "standard"        # standard, min_max, robust, none
      pad_features: "standard"  # PAD特征标准化方法
      vitality_feature: "min_max" # 活力值标准化方法
      
    # 标签标准化
    label_scaling:
      method: "standard"
      delta_pad: "standard"
      delta_pressure: "standard"
      confidence: "none"
    
    # 数据增强
    augmentation:
      enabled: false
      noise_std: 0.01
      mixup_alpha: 0.2
      augmentation_factor: 2
    
    # 数据验证
    validation:
      check_ranges: true
      check_missing: true
      check_outliers: true
      outlier_method: "iqr"  # iqr, zscore, isolation_forest
  
  # 数据加载器配置
  dataloader:
    batch_size: 32
    num_workers: 4
    pin_memory: true
    shuffle: true
    drop_last: false
    persistent_workers: true
    
  # 数据分割
  split:
    train_ratio: 0.8
    val_ratio: 0.1
    test_ratio: 0.1
    stratify: false
    random_seed: 42

# 训练超参数
training:
  # 训练轮次
  epochs:
    max_epochs: 200
    warmup_epochs: 5
    
  # 早停配置
  early_stopping:
    enabled: true
    patience: 15          # 监控轮数
    min_delta: 1e-4       # 最小改善
    monitor: "val_loss"   # 监控指标
    mode: "min"           # min/max
    restore_best_weights: true
    
  # 梯度配置
  gradient:
    clip_enabled: true
    clip_value: 1.0
    clip_norm: 2          # 1: L1 norm, 2: L2 norm
    
  # 混合精度训练
  mixed_precision:
    enabled: false
    opt_level: "O1"       # O0, O1, O2, O3
    
  # 梯度累积
  gradient_accumulation:
    enabled: false
    accumulation_steps: 4

# 优化器配置
optimizer:
  type: "AdamW"          # Adam, SGD, AdamW, RMSprop, Adagrad
  
  # Adam/AdamW 参数
  adam_config:
    lr: 0.0005           # 学习率
    weight_decay: 0.01   # 权重衰减
    betas: [0.9, 0.999]  # Beta参数
    eps: 1e-8           # 数值稳定性
    amsgrad: false       # AMSGrad变体
    
  # SGD 参数
  sgd_config:
    lr: 0.01
    momentum: 0.9
    weight_decay: 0.0001
    nesterov: true
    
  # RMSprop 参数
  rmsprop_config:
    lr: 0.001
    alpha: 0.99
    weight_decay: 0.0
    momentum: 0.0

# 学习率调度器配置
scheduler:
  type: "CosineAnnealingLR"  # StepLR, CosineAnnealingLR, ReduceLROnPlateau, ExponentialLR
  
  # 余弦退火调度器
  cosine_config:
    T_max: 200              # 最大轮数
    eta_min: 1e-6          # 最小学习率
    last_epoch: -1
    
  # 步长调度器
  step_config:
    step_size: 30           # 步长
    gamma: 0.1             # 衰减因子
    
  # 平台衰减调度器
  plateau_config:
    patience: 10           # 耐心值
    factor: 0.5           # 衰减因子
    min_lr: 1e-7          # 最小学习率
    threshold: 1e-4       # 改善阈值
    verbose: true

# 损失函数配置
loss:
  type: "WeightedMSELoss"  # MSELoss, L1Loss, SmoothL1Loss, HuberLoss, WeightedMSELoss
  
  # 基础损失参数
  base_config:
    reduction: "mean"      # mean, sum, none
    
  # 加权损失配置
  weighted_config:
    delta_pad_weight: 1.0      # ΔPAD预测权重
    delta_pressure_weight: 1.0 # ΔPressure预测权重  
    confidence_weight: 0.5     # 置信度预测权重
    
  # Huber损失配置
  huber_config:
    delta: 1.0             # Huber阈值
    
  # 焦点损失配置 (可选)
  focal_config:
    alpha: 1.0
    gamma: 2.0

# 验证配置
validation:
  # 验证频率
  val_frequency: 1        # 每多少个epoch验证一次
  
  # 验证指标
  metrics:
    - "MSE"
    - "MAE"
    - "RMSE"
    - "R2"
    - "MAPE"
    
  # 模型选择
  model_selection:
    criterion: "val_loss"  # val_loss, val_mae, val_r2
    mode: "min"           # min/max
    
  # 验证数据增强
  val_augmentation:
    enabled: false
    methods: []

# 日志和监控配置
logging:
  # 日志级别
  level: "INFO"           # DEBUG, INFO, WARNING, ERROR
  
  # 日志文件
  log_dir: "logs"
  log_file: "training.log"
  max_file_size: "10MB"
  backup_count: 5
  
  # TensorBoard
  tensorboard:
    enabled: true
    log_dir: "runs"
    comment: ""
    flush_secs: 10
    
  # Wandb
  wandb:
    enabled: false
    project: "emotion-prediction"
    entity: "your-team"
    tags: []
    notes: ""
    
  # 进度条
  progress_bar:
    enabled: true
    update_frequency: 10   # 更新频率
    leave: true           # 训练完成后是否保留

# 检查点保存配置
checkpointing:
  # 保存目录
  save_dir: "checkpoints"
  
  # 保存策略
  save_strategy: "best"   # best, last, all
  
  # 文件命名
  filename_template: "model_epoch_{epoch}_val_{val_loss:.4f}.pth"
  
  # 保存内容
  save_items:
    - "model_state_dict"
    - "optimizer_state_dict"
    - "scheduler_state_dict"
    - "epoch"
    - "loss"
    - "metrics"
    - "config"
    
  # 保存频率
  save_frequency: 1       # 每多少个epoch保存一次
  
  # 最大检查点数量
  max_checkpoints: 5

# 硬件配置
hardware:
  # 设备选择
  device: "auto"          # auto, cpu, cuda, mps
  
  # GPU配置
  gpu:
    id: 0                 # GPU ID
    memory_fraction: 0.9  # GPU内存使用比例
    allow_growth: true    # 动态内存增长
    
  # 混合精度
  mixed_precision:
    enabled: false
    opt_level: "O1"
    
  # 分布式训练
  distributed:
    enabled: false
    backend: "nccl"
    init_method: "env://"
    world_size: 1
    rank: 0

# 调试配置
debug:
  # 调试模式
  enabled: false
  
  # 快速训练
  fast_train:
    enabled: false
    max_epochs: 5
    batch_size: 8
    subset_size: 100
    
  # 梯度检查
  gradient_checking:
    enabled: false
    clip_value: 1.0
    check_nan: true
    check_inf: true
    
  # 数据检查
  data_checking:
    enabled: true
    check_nan: true
    check_inf: true
    check_range: true
    sample_output: true
    
  # 模型检查
  model_checking:
    enabled: false
    count_parameters: true
    check_gradients: true
    visualize_model: false

# 实验跟踪配置
experiment_tracking:
  # 是否启用实验跟踪
  enabled: false
  
  # MLflow配置
  mlflow:
    tracking_uri: "http://localhost:5000"
    experiment_name: "emotion_prediction"
    run_name: null
    tags: {}
    params: {}
    
  # Weights & Biases配置
  wandb:
    project: "emotion-prediction"
    entity: null
    group: null
    job_type: "training"
    tags: []
    notes: ""
    config: {}
    
  # 本地实验跟踪
  local:
    save_dir: "experiments"
    save_config: true
    save_metrics: true
    save_model: true
```

### 训练配置参数详解

#### `training.epochs` 训练轮次

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `max_epochs` | int | 是 | 200 | 最大训练轮数 |
| `warmup_epochs` | int | 否 | 0 | 预热轮数 |

#### `training.early_stopping` 早停配置

| 参数 | 类型 | 必需 | 默认值 | 说明 |
|------|------|------|--------|------|
| `enabled` | bool | 否 | true | 是否启用早停 |
| `patience` | int | 否 | 10 | 耐心值（轮数） |
| `min_delta` | float | 否 | 1e-4 | 最小改善阈值 |
| `monitor` | str | 否 | val_loss | 监控指标 |
| `mode` | str | 否 | min | 监控模式 (min/max) |
| `restore_best_weights` | bool | 否 | true | 恢复最佳权重 |

#### `optimizer` 优化器配置

支持的优化器类型：
- `Adam`: 自适应矩估计
- `AdamW`: Adam with Weight Decay
- `SGD`: 随机梯度下降
- `RMSprop`: RMSprop优化器
- `Adagrad`: 自适应梯度算法

#### `scheduler` 学习率调度器

支持的调度器类型：
- `StepLR`: 步长衰减
- `CosineAnnealingLR`: 余弦退火
- `ReduceLROnPlateau`: 平台衰减
- `ExponentialLR`: 指数衰减

## 数据配置

### 数据配置文件: `configs/data_config.yaml`

```yaml
# ========================================
# 数据配置文件
# ========================================

# 数据路径配置
paths:
  # 训练数据
  train_data: "data/train.csv"
  val_data: "data/val.csv"
  test_data: "data/test.csv"
  
  # 预处理器
  preprocessor: "models/preprocessor.pkl"
  
  # 数据统计
  statistics: "data/statistics.json"
  
  # 数据质量报告
  quality_report: "reports/data_quality.html"

# 数据源配置
data_source:
  type: "csv"              # csv, json, parquet, hdf5, database
  
  # CSV配置
  csv_config:
    delimiter: ","
    encoding: "utf-8"
    header: 0
    index_col: null
    
  # JSON配置
  json_config:
    orient: "records"      # records, index, values, columns
    lines: false
    
  # 数据库配置
  database_config:
    connection_string: "sqlite:///data.db"
    table: "emotion_data"
    query: null

# 数据预处理配置
preprocessing:
  # 特征处理
  features:
    # 缺失值处理
    missing_values:
      strategy: "drop"     # drop, fill_mean, fill_median, fill_mode, fill_constant
      fill_value: 0.0
      
    # 异常值处理
    outliers:
      method: "iqr"        # iqr, zscore, isolation_forest, none
      threshold: 1.5
      action: "clip"       # clip, remove, flag
      
    # 特征缩放
    scaling:
      method: "standard"   # standard, minmax, robust, none
      feature_range: [-1, 1]  # MinMax缩放范围
      
    # 特征选择
    selection:
      enabled: false
      method: "correlation" # correlation, mutual_info, rfe
      k_best: 10
      
  # 标签处理
  labels:
    # 缺失值处理
    missing_values:
      strategy: "fill_mean"
      
    # 标签缩放
    scaling:
      method: "standard"
      
    # 标签变换
    transformation:
      enabled: false
      method: "log"        # log, sqrt, boxcox

# 数据增强配置
augmentation:
  enabled: false
  
  # 噪声注入
  noise_injection:
    enabled: true
    noise_type: "gaussian" # gaussian, uniform
    noise_std: 0.01
    feature_wise: true
    
  # Mixup增强
  mixup:
    enabled: true
    alpha: 0.2
    
  # SMOTE增强 (用于不平衡数据)
  smote:
    enabled: false
    k_neighbors: 5
    sampling_strategy: "auto"

# 数据验证配置
validation:
  # 数值范围检查
  range_validation:
    enabled: true
    features:
      user_pleasure: [-1.0, 1.0]
      user_arousal: [-1.0, 1.0]
      user_dominance: [-1.0, 1.0]
      vitality: [0.0, 100.0]
      current_pleasure: [-1.0, 1.0]
      current_arousal: [-1.0, 1.0]
      current_dominance: [-1.0, 1.0]
    labels:
      delta_pleasure: [-0.5, 0.5]
      delta_arousal: [-0.5, 0.5]
      delta_dominance: [-0.5, 0.5]
      delta_pressure: [-0.3, 0.3]
      confidence: [0.0, 1.0]
      
  # 数据质量检查
  quality_checks:
    check_duplicates: true
    check_missing: true
    check_outliers: true
    check_correlations: true
    check_distribution: true
    
  # 统计报告
  statistics:
    compute_descriptive: true
    compute_correlations: true
    compute_distributions: true
    save_plots: true

# 合成数据配置
synthetic_data:
  enabled: false
  
  # 生成参数
  generation:
    num_samples: 1000
    seed: 42
    
  # 数据分布
  distribution:
    type: "multivariate_normal"  # normal, uniform, multivariate_normal
    mean: null
    cov: null
    
  # 相关性配置
  correlation:
    enabled: true
    strength: 0.5
    structure: "block"  # block, random, toeplitz
    
  # 噪声配置
  noise:
    add_noise: true
    noise_type: "gaussian"
    noise_std: 0.1
```

## 推理配置

### 推理配置文件: `configs/inference_config.yaml`

```yaml
# ========================================
# 推理配置文件
# ========================================

# 推理基本信息
inference_info:
  model_path: "models/best_model.pth"
  preprocessor_path: "models/preprocessor.pkl"
  device: "auto"
  batch_size: 32
  
# 输入配置
input:
  # 输入格式
  format: "auto"          # auto, list, numpy, pandas, json, csv
  
  # 输入验证
  validation:
    enabled: true
    check_shape: true
    check_range: true
    check_type: true
    
  # 输入预处理
  preprocessing:
    normalize: true
    handle_missing: "error"  # error, fill, skip
    missing_value: 0.0

# 输出配置
output:
  # 输出格式
  format: "dict"          # dict, json, csv, numpy
  
  # 输出内容
  include:
    predictions: true
    confidence: true
    components: true       # delta_pad, delta_pressure, confidence
    metadata: false        # inference_time, model_info
    
  # 输出后处理
  postprocessing:
    clip_predictions: true
    round_decimals: 6
    format_confidence: "percentage"  # decimal, percentage

# 性能优化配置
optimization:
  # 模型优化
  model_optimization:
    enabled: true
    torch_script: false
    onnx: false
    quantization: false
    
  # 推理优化
  inference_optimization:
    warmup: true
    warmup_samples: 5
    batch_optimization: true
    memory_optimization: true
    
  # 缓存配置
  caching:
    enabled: false
    cache_size: 1000
    cache_policy: "lru"   # lru, fifo

# 监控配置
monitoring:
  # 性能监控
  performance:
    enabled: true
    track_latency: true
    track_memory: true
    track_throughput: true
    
  # 质量监控
  quality:
    enabled: false
    confidence_threshold: 0.5
    prediction_validation: true
    
  # 异常检测
  anomaly_detection:
    enabled: false
    method: "statistical"  # statistical, isolation_forest
    threshold: 2.0

# 服务配置 (用于部署)
service:
  # API配置
  api:
    host: "0.0.0.0"
    port: 8000
    workers: 1
    timeout: 30
    
  # 限流配置
  rate_limiting:
    enabled: false
    requests_per_minute: 100
    
  # 认证配置
  authentication:
    enabled: false
    method: "api_key"     # api_key, jwt, basic
    
  # 日志配置
  logging:
    level: "INFO"
    format: "json"
```

## 日志配置

### 日志配置文件: `configs/logging_config.yaml`

```yaml
# ========================================
# 日志配置文件
# ========================================

# 日志系统配置
logging:
  # 根日志器
  root:
    level: "INFO"
    handlers: ["console", "file"]
    
  # 日志器配置
  loggers:
    training:
      level: "INFO"
      handlers: ["console", "file", "tensorboard"]
      propagate: false
      
    inference:
      level: "WARNING"
      handlers: ["console", "file"]
      propagate: false
      
    data:
      level: "DEBUG"
      handlers: ["file"]
      propagate: false

# 处理器配置
handlers:
  # 控制台处理器
  console:
    class: "StreamHandler"
    level: "INFO"
    formatter: "console"
    stream: "ext://sys.stdout"
    
  # 文件处理器
  file:
    class: "RotatingFileHandler"
    level: "DEBUG"
    formatter: "detailed"
    filename: "logs/app.log"
    maxBytes: 10485760     # 10MB
    backupCount: 5
    encoding: "utf8"
    
  # 错误文件处理器
  error_file:
    class: "RotatingFileHandler"
    level: "ERROR"
    formatter: "detailed"
    filename: "logs/error.log"
    maxBytes: 10485760
    backupCount: 3
    encoding: "utf8"

# 格式化器配置
formatters:
  # 控制台格式
  console:
    format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    datefmt: "%H:%M:%S"
    
  # 详细格式
  detailed:
    format: "%(asctime)s - %(name)s - %(levelname)s - %(module)s:%(lineno)d - %(funcName)s - %(message)s"
    datefmt: "%Y-%m-%d %H:%M:%S"
    
  # JSON格式
  json:
    format: '{"timestamp": "%(asctime)s", "level": "%(levelname)s", "logger": "%(name)s", "module": "%(module)s", "line": %(lineno)d, "message": "%(message)s"}'
    datefmt: "%Y-%m-%dT%H:%M:%S"

# 日志过滤配置
filters:
  # 性能过滤器
  performance:
    class: "PerformanceFilter"
    threshold: 0.1
    
  # 敏感信息过滤器
  sensitive:
    class: "SensitiveDataFilter"
    patterns: ["password", "token", "key"]
```

## 硬件配置

### 硬件配置文件: `configs/hardware_config.yaml`

```yaml
# ========================================
# 硬件配置文件
# ========================================

# 设备配置
device:
  # 自动选择
  auto:
    priority: ["cuda", "mps", "cpu"]  # 设备优先级
    memory_threshold: 0.8              # 内存使用阈值
    
  # CPU配置
  cpu:
    num_threads: null                  # null为自动检测
    use_openmp: true
    use_mkl: true
    
  # GPU配置
  gpu:
    # GPU选择
    device_id: 0                       # GPU ID
    memory_fraction: 0.9               # GPU内存使用比例
    allow_growth: true                 # 动态内存增长
    
    # CUDA配置
    cuda:
      allow_tf32: true                 # 启用TF32
      benchmark: true                  # 启用cuDNN基准
      deterministic: false             # 确定性模式
      
    # 混合精度
    mixed_precision:
      enabled: false
      opt_level: "O1"                  # O0, O1, O2, O3
      loss_scale: "dynamic"            # static, dynamic
      
    # 多GPU配置
    multi_gpu:
      enabled: false
      device_ids: [0, 1]
      output_device: 0
      dim: 0                           # 数据并行维度

# 内存配置
memory:
  # 系统内存
  system:
    max_usage: 0.8                     # 最大使用比例
    cleanup_threshold: 0.9             # 清理阈值
    
  # GPU内存
  gpu:
    max_usage: 0.9
    cleanup_interval: 100              # 清理间隔（步数）
    
  # 内存优化
  optimization:
    enable_gc: true                    # 启用垃圾回收
    gc_threshold: 0.8                  # GC触发阈值
    pin_memory: true                   # 锁页内存
    share_memory: true                 # 共享内存

# 性能配置
performance:
  # 并行配置
  parallel:
    num_workers: 4                     # 数据加载器工作进程数
    prefetch_factor: 2                 # 预取因子
    
  # 缓存配置
  cache:
    model_cache: true                  # 模型缓存
    data_cache: true                   # 数据缓存
    cache_size: 1024                   # 缓存大小（MB）
    
  # 编译优化
  compilation:
    torch_compile: false               # PyTorch 2.0编译
    jit_script: true                   # TorchScript
    mode: "default"                    # default, reduce-overhead, max-autotune
```

## 配置最佳实践

### 1. 配置文件组织

```
configs/
├── model_config.yaml          # 模型配置
├── training_config.yaml       # 训练配置
├── data_config.yaml          # 数据配置
├── inference_config.yaml     # 推理配置
├── logging_config.yaml       # 日志配置
├── hardware_config.yaml      # 硬件配置
├── environments/             # 环境特定配置
│   ├── development.yaml
│   ├── staging.yaml
│   └── production.yaml
└── experiments/              # 实验特定配置
    ├── baseline.yaml
    ├── large_model.yaml
    └── fast_train.yaml
```

### 2. 配置继承

```yaml
# configs/experiments/large_model.yaml
_base_: "../training_config.yaml"

training:
  epochs:
    max_epochs: 500
    
model:
  architecture:
    hidden_layers:
      - size: 256
        activation: "ReLU"
        dropout: 0.3
      - size: 128
        activation: "ReLU"
        dropout: 0.2
      - size: 64
        activation: "ReLU"
        dropout: 0.1

experiment_tracking:
  enabled: true
  mlflow:
    experiment_name: "large_model_experiment"
```

### 3. 环境变量替换

```yaml
# 使用环境变量
model_path: "${MODEL_PATH:/models/default.pth}"
learning_rate: "${LEARNING_RATE:0.001}"
batch_size: "${BATCH_SIZE:32}"
```

### 4. 配置验证

```python
from src.utils.config import ConfigValidator
from src.utils.config import ValidationError

# 创建验证器
validator = ConfigValidator()

# 添加验证规则
validator.add_rule("training.optimizer.lr", lambda x: 0 < x <= 1)
validator.add_rule("model.hidden_dims", lambda x: len(x) > 0)

# 验证配置
try:
    validator.validate(config)
except ValidationError as e:
    print(f"配置验证失败: {e}")
```

### 5. 配置版本管理

```yaml
# 配置文件版本
config_version: "1.0"
compatibility_version: ">=0.9.0"

# 变更日志
changelog:
  - version: "1.0"
    changes: ["添加混合精度支持", "更新学习率调度器"]
  - version: "0.9"
    changes: ["初始版本"]
```

## 配置验证

### 配置验证器

```python
class ConfigValidator:
    """配置验证器"""
    
    def __init__(self):
        self.rules = {}
        self.schemas = {}
    
    def add_rule(self, path: str, validator: callable, message: str = None):
        """添加验证规则"""
        self.rules[path] = {
            'validator': validator,
            'message': message or f"Invalid value at {path}"
        }
    
    def add_schema(self, section: str, schema: Dict):
        """添加配置模式"""
        self.schemas[section] = schema
    
    def validate(self, config: Dict) -> bool:
        """验证配置"""
        for path, rule in self.rules.items():
            value = self._get_nested_value(config, path)
            if not rule['validator'](value):
                raise ValidationError(rule['message'])
        return True
    
    def _get_nested_value(self, config: Dict, path: str):
        """获取嵌套值"""
        keys = path.split('.')
        value = config
        for key in keys:
            value = value.get(key)
            if value is None:
                return None
        return value
```

### 常用验证规则

```python
# 数值范围验证
validator.add_rule("training.optimizer.lr", lambda x: 0 < x <= 1, "学习率必须在(0, 1]范围内")
validator.add_rule("model.dropout_rate", lambda x: 0 <= x < 1, "Dropout率必须在[0, 1)范围内")

# 列表验证
validator.add_rule("model.hidden_dims", lambda x: isinstance(x, list) and len(x) > 0