Nexuss-Transformer / Tutorials /08-continual-learning-lifecycle.md

Upload data/train-00000-of-00001.parquet with huggingface_hub

7cb972e 18 days ago

74.7 kB

	# Tutorial 08: Continual Learning & Model Lifecycle Management

	## Overview

	This tutorial covers continual learning strategies for updating models with new data while preventing catastrophic forgetting, along with complete model lifecycle management including versioning, deployment, monitoring, and retirement. We'll integrate NTF's configuration system for hyperparameter tuning and use the `ContinualLearningWrapper` for EWC regularization.

	## Table of Contents

	1. [Continual Learning Fundamentals](#continual-learning-fundamentals)
	2. [Catastrophic Forgetting](#catastrophic-forgetting)
	3. [Replay-Based Methods](#replay-based-methods)
	4. [Regularization-Based Methods](#regularization-based-methods)
	5. [Architecture-Based Methods](#architecture-based-methods)
	6. [Incremental Training Strategies](#incremental-training-strategies)
	7. [Hyperparameter Tuning with NTF Config](#hyperparameter-tuning-with-ntf-config)
	8. [Model Versioning](#model-versioning)
	9. [Deployment Strategies](#deployment-strategies)
	10. [Production Monitoring](#production-monitoring)
	11. [Model Retirement and Archival](#model-retirement-and-archival)

	---

	## Continual Learning Fundamentals

	### What is Continual Learning?

	Continual learning (also called lifelong learning or incremental learning) enables models to:
	- Learn continuously from new data over time
	- Adapt to changing distributions and tasks
	- Accumulate knowledge without forgetting previous capabilities
	- Operate in non-stationary environments

	### Key Challenges

	1. Catastrophic Forgetting: Learning new information causes loss of old knowledge
	2. Stability-Plasticity Dilemma: Balance between retaining old knowledge and learning new patterns
	3. Task Boundary Detection: Knowing when a new task or distribution shift occurs
	4. Computational Efficiency: Avoiding full retraining on all historical data
	5. Memory Constraints: Storing representative examples without keeping everything

	### Continual Learning Scenarios

	```python
	from enum import Enum

	class ContinualLearningScenario(Enum):
	"""Different continual learning scenarios."""

	TASK_INCREMENTAL = "task_incremental"
	# New tasks arrive sequentially with clear boundaries
	# Example: First train on sentiment analysis, then on QA

	DOMAIN_INCREMENTAL = "domain_incremental"
	# Same task but domains change over time
	# Example: News articles from 2020, then 2021, then 2022

	CLASS_INCREMENTAL = "class_incremental"
	# New classes appear over time
	# Example: First classify cats/dogs, then add birds, then fish

	INSTANCE_INCREMENTAL = "instance_incremental"
	# Same task and classes, just more data arrives
	# Example: Continuous stream of customer support tickets
	```

	### Evaluation Metrics for Continual Learning

	```python
	import numpy as np

	class ContinualLearningMetrics:
	def __init__(self):
	self.task_accuracies = {} # {task_id: {timestamp: accuracy}}

	def record_accuracy(self, task_id, timestamp, accuracy):
	"""Record accuracy for a task at a specific time."""
	if task_id not in self.task_accuracies:
	self.task_accuracies[task_id] = {}
	self.task_accuracies[task_id][timestamp] = accuracy

	def calculate_forward_transfer(self):
	"""
	Forward Transfer: How much does learning task A help with task B?

	Positive values indicate beneficial transfer.
	"""
	transfers = []

	task_ids = sorted(self.task_accuracies.keys())

	for i, task_b in enumerate(task_ids[1:], 1):
	# Get initial accuracy on task B before training
	# Compare to accuracy after training on previous tasks

	# Simplified: compare first evaluation to best later evaluation
	times_b = sorted(self.task_accuracies[task_b].keys())
	if len(times_b) > 1:
	initial_acc = self.task_accuracies[task_b][times_b[0]]
	best_acc = max(self.task_accuracies[task_b][t] for t in times_b)
	transfer = best_acc - initial_acc
	transfers.append(transfer)

	return np.mean(transfers) if transfers else 0.0

	def calculate_backward_transfer(self):
	"""
	Backward Transfer: How does learning new tasks affect old tasks?

	Negative values indicate forgetting.
	"""
	transfers = []

	task_ids = sorted(self.task_accuracies.keys())

	for i, task_a in enumerate(task_ids[:-1]):
	times_a = sorted(self.task_accuracies[task_a].keys())

	if len(times_a) > 1:
	# Accuracy immediately after training on task A
	initial_acc = self.task_accuracies[task_a][times_a[0]]

	# Accuracy after training on all subsequent tasks
	final_acc = self.task_accuracies[task_a][times_a[-1]]

	transfer = final_acc - initial_acc
	transfers.append(transfer)

	return np.mean(transfers) if transfers else 0.0

	def calculate_forgetting_measure(self):
	"""
	Forgetting Measure: Maximum decrease in accuracy on any old task.
	"""
	forgetting_scores = []

	for task_id, time_accuracies in self.task_accuracies.items():
	times = sorted(time_accuracies.keys())

	if len(times) > 1:
	max_acc = max(time_accuracies[t] for t in times)
	final_acc = time_accuracies[times[-1]]

	forgetting = max_acc - final_acc
	forgetting_scores.append(forgetting)

	return np.mean(forgetting_scores) if forgetting_scores else 0.0

	def calculate_average_accuracy(self, final_only=False):
	"""
	Average Accuracy across all tasks.
	"""
	all_final_accuracies = []

	for task_id, time_accuracies in self.task_accuracies.items():
	if final_only:
	# Only use final accuracy
	final_time = max(time_accuracies.keys())
	all_final_accuracies.append(time_accuracies[final_time])
	else:
	# Average across all evaluations
	all_final_accuracies.extend(time_accuracies.values())

	return np.mean(all_final_accuracies) if all_final_accuracies else 0.0

	def generate_report(self):
	"""Generate comprehensive continual learning report."""
	report = {
	'average_accuracy': self.calculate_average_accuracy(final_only=True),
	'forward_transfer': self.calculate_forward_transfer(),
	'backward_transfer': self.calculate_backward_transfer(),
	'forgetting_measure': self.calculate_forgetting_measure()
	}

	print("Continual Learning Performance Report")
	print("=" * 50)
	print(f"Average Accuracy: {report['average_accuracy']:.4f}")
	print(f"Forward Transfer: {report['forward_transfer']:+.4f}")
	print(f"Backward Transfer: {report['backward_transfer']:+.4f}")
	print(f"Forgetting Measure: {report['forgetting_measure']:.4f}")
	print("=" * 50)

	if report['forgetting_measure'] < 0.05:
	print("✅ Minimal forgetting detected")
	elif report['forgetting_measure'] < 0.15:
	print("⚠️ Moderate forgetting - consider mitigation")
	else:
	print("❌ Severe forgetting - immediate action needed")

	return report
	```

	---

	## Catastrophic Forgetting

	### Understanding the Problem

	```python
	import torch
	import torch.nn as nn
	import matplotlib.pyplot as plt

	def demonstrate_catastrophic_forgetting(model, task1_data, task2_data):
	"""
	Demonstrate catastrophic forgetting phenomenon.

	Train on Task 1, then Task 2, observe performance drop on Task 1.
	"""
	metrics = {
	'task1_before': [],
	'task1_after_task2': [],
	'task2_after': []
	}

	optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
	criterion = nn.CrossEntropyLoss()

	# Phase 1: Train on Task 1
	print("Phase 1: Training on Task 1...")
	for epoch in range(10):
	model.train()
	for texts, labels in task1_data:
	optimizer.zero_grad()
	outputs = model(texts)
	loss = criterion(outputs, labels)
	loss.backward()
	optimizer.step()

	# Evaluate on Task 1
	task1_acc = evaluate(model, task1_data)
	metrics['task1_before'].append(task1_acc)
	print(f"Task 1 Accuracy after Task 1 training: {task1_acc:.4f}")

	# Phase 2: Train on Task 2 (without seeing Task 1 data)
	print("\nPhase 2: Training on Task 2...")
	for epoch in range(10):
	model.train()
	for texts, labels in task2_data:
	optimizer.zero_grad()
	outputs = model(texts)
	loss = criterion(outputs, labels)
	loss.backward()
	optimizer.step()

	# Evaluate on both tasks
	task1_acc_after = evaluate(model, task1_data)
	task2_acc = evaluate(model, task2_data)

	metrics['task1_after_task2'].append(task1_acc_after)
	metrics['task2_after'].append(task2_acc)

	print(f"\nTask 1 Accuracy after Task 2 training: {task1_acc_after:.4f}")
	print(f"Task 2 Accuracy: {task2_acc:.4f}")

	forgetting = task1_acc - task1_acc_after
	print(f"\n📉 FORGETTING: {forgetting:.4f} ({forgetting/task1_acc*100:.1f}% drop)")

	# Visualization
	fig, ax = plt.subplots(figsize=(10, 6))

	ax.bar(['Task 1\n(Before)', 'Task 1\n(After)', 'Task 2'],
	[task1_acc, task1_acc_after, task2_acc],
	color=['green', 'red', 'blue'], alpha=0.7)

	ax.set_ylabel('Accuracy')
	ax.set_title('Demonstration of Catastrophic Forgetting')
	ax.set_ylim(0, 1)

	for i, v in enumerate([task1_acc, task1_acc_after, task2_acc]):
	ax.text(i, v + 0.02, f'{v:.3f}', ha='center')

	plt.tight_layout()
	plt.savefig('catastrophic_forgetting_demo.png')
	plt.show()

	return metrics, forgetting
	```

	### Why Does Forgetting Happen?

	1. Weight Interference: New task optimizes weights in directions that conflict with old task
	2. Representation Shift: Hidden representations change to accommodate new patterns
	3. Decision Boundary Movement: Classification boundaries shift away from old class regions
	4. Capacity Limits: Model has finite capacity; new knowledge displaces old

	---

	## Replay-Based Methods

	### Experience Replay

	```python
	import random
	from collections import deque
	import pickle

	class ExperienceReplayBuffer:
	"""
	Store and sample past experiences for replay during training.
	"""

	def __init__(self, max_size=10000, strategy='uniform'):
	"""
	Args:
	max_size: Maximum number of samples to store
	strategy: 'uniform', 'reservoir', 'class_balanced'
	"""
	self.max_size = max_size
	self.strategy = strategy
	self.buffer = deque(maxlen=max_size)
	self.class_buffers = {} # For class-balanced sampling

	def add(self, sample, label=None):
	"""Add a sample to the replay buffer."""
	if self.strategy == 'class_balanced' and label is not None:
	if label not in self.class_buffers:
	self.class_buffers[label] = deque(maxlen=self.max_size // 10)
	self.class_buffers[label].append(sample)
	else:
	self.buffer.append((sample, label))

	def sample(self, batch_size):
	"""Sample a batch from the replay buffer."""
	if self.strategy == 'class_balanced':
	# Sample equally from each class
	all_samples = []
	for label, class_buf in self.class_buffers.items():
	if len(class_buf) > 0:
	n_samples = min(batch_size // len(self.class_buffers), len(class_buf))
	all_samples.extend(random.sample(list(class_buf), n_samples))

	# Pad if necessary
	while len(all_samples) < batch_size and self.buffer:
	all_samples.append(random.choice(list(self.buffer)))

	return random.sample(all_samples, min(batch_size, len(all_samples)))

	elif self.strategy == 'reservoir':
	# Reservoir sampling already handled by add method
	return random.sample(list(self.buffer), min(batch_size, len(self.buffer)))

	else: # uniform
	return random.sample(list(self.buffer), min(batch_size, len(self.buffer)))

	def __len__(self):
	return len(self.buffer)

	def save(self, path):
	"""Save replay buffer to disk."""
	with open(path, 'wb') as f:
	pickle.dump({
	'buffer': list(self.buffer),
	'class_buffers': {k: list(v) for k, v in self.class_buffers.items()},
	'strategy': self.strategy
	}, f)

	def load(self, path):
	"""Load replay buffer from disk."""
	with open(path, 'rb') as f:
	data = pickle.load(f)
	self.buffer = deque(data['buffer'], maxlen=self.max_size)
	self.class_buffers = {
	k: deque(v, maxlen=self.max_size // 10)
	for k, v in data['class_buffers'].items()
	}
	self.strategy = data['strategy']


	class ReplayBasedTrainer:
	"""
	Trainer with experience replay for continual learning.
	"""

	def __init__(self, model, replay_buffer, config):
	self.model = model
	self.replay_buffer = replay_buffer
	self.config = config
	self.optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
	self.criterion = nn.CrossEntropyLoss()

	def train_step(self, current_batch, replay_batch_size=32):
	"""
	Train on current data mixed with replayed examples.
	"""
	self.model.train()

	# Get replay samples
	if len(self.replay_buffer) > 0:
	replay_samples = self.replay_buffer.sample(replay_batch_size)
	replay_texts = [s[0] for s in replay_samples]
	replay_labels = [s[1] for s in replay_samples]

	# Combine current and replay data
	combined_texts = current_batch['texts'] + replay_texts
	combined_labels = current_batch['labels'] + replay_labels
	else:
	combined_texts = current_batch['texts']
	combined_labels = current_batch['labels']

	# Train on combined batch
	self.optimizer.zero_grad()
	outputs = self.model(combined_texts)
	loss = self.criterion(outputs, combined_labels)
	loss.backward()
	self.optimizer.step()

	return loss.item()

	def train_on_task(self, task_data_loader, task_id, epochs=5):
	"""Train on a new task while replaying old experiences."""
	print(f"\nTraining on Task {task_id} with replay...")

	for epoch in range(epochs):
	total_loss = 0

	for batch in task_data_loader:
	loss = self.train_step(batch, replay_batch_size=32)
	total_loss += loss

	# Add current batch to replay buffer
	for text, label in zip(batch['texts'], batch['labels']):
	self.replay_buffer.add((text, label), label=label)

	avg_loss = total_loss / len(task_data_loader)
	print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

	# Save replay buffer checkpoint
	self.replay_buffer.save(f'replay_buffer_task{task_id}.pkl')
	```

	### Generative Replay

	```python
	class GenerativeReplay:
	"""
	Use a generative model to recreate past data instead of storing it.

	Benefits:
	- Privacy: Don't store actual past data
	- Compression: Generate many variations from compact model
	- Scalability: No memory limit on "stored" experiences
	"""

	def __init__(self, generator_model, generator_config):
	self.generator = generator_model
	self.config = generator_config

	def train_generator(self, data_loader, epochs=10):
	"""Train generator on current task data."""
	self.generator.train()
	optimizer = torch.optim.Adam(self.generator.parameters(), lr=1e-4)

	for epoch in range(epochs):
	total_loss = 0

	for batch in data_loader:
	optimizer.zero_grad()

	# Train generator to reconstruct input data
	reconstructed = self.generator(batch['texts'])
	loss = self.reconstruction_loss(batch['texts'], reconstructed)

	loss.backward()
	optimizer.step()
	total_loss += loss.item()

	print(f"Generator Epoch {epoch + 1}/{epochs}, Loss: {total_loss/len(data_loader):.4f}")

	def generate_replay_samples(self, n_samples=100):
	"""Generate synthetic samples resembling past data."""
	self.generator.eval()

	generated_samples = []

	with torch.no_grad():
	for _ in range(n_samples):
	# Sample from latent space
	latent = torch.randn(1, self.config.latent_dim)

	# Generate text/features
	generated = self.generator.decode(latent)
	generated_samples.append(generated)

	return generated_samples

	def reconstruction_loss(self, original, reconstructed):
	"""Calculate reconstruction loss for generator training."""
	# Implementation depends on generator architecture
	# Could be MSE, cross-entropy, or perceptual loss
	return nn.MSELoss()(original, reconstructed)
	```

	### Dark Experience Replay

	```python
	class DarkExperienceReplay:
	"""
	Store model outputs (logits) along with inputs for replay.

	Instead of storing (x, y), store (x, model_output_at_time_t).
	This preserves the model's learned behavior, not just labels.
	"""

	def __init__(self, model, max_size=5000):
	self.model = model
	self.max_size = max_size
	self.memory = deque(maxlen=max_size)

	def collect_experience(self, texts, labels):
	"""Collect experiences with model predictions."""
	self.model.eval()

	with torch.no_grad():
	logits = self.model.get_logits(texts)
	probabilities = torch.softmax(logits, dim=-1)

	# Store input, true label, and model's predicted distribution
	for text, label, prob_dist in zip(texts, labels, probabilities):
	experience = {
	'text': text,
	'true_label': label,
	'predicted_distribution': prob_dist.cpu().numpy(),
	'timestamp': len(self.memory)
	}
	self.memory.append(experience)

	def replay_loss(self, current_logits, stored_experiences):
	"""
	Calculate distillation loss to match old predictions.
	"""
	total_loss = 0

	for exp in stored_experiences:
	# Get current prediction for stored input
	current_pred = current_logits[exp['text']]

	# Stored prediction (from old model)
	old_pred = torch.tensor(exp['predicted_distribution'])

	# KL divergence to match old predictions
	kl_loss = nn.KLDivLoss(reduction='batchmean')(
	torch.log_softmax(current_pred, dim=-1),
	old_pred
	)

	total_loss += kl_loss

	return total_loss / len(stored_experiences)
	```

	---

	## Regularization-Based Methods

	### Elastic Weight Consolidation (EWC)

	```python
	import torch.nn.functional as F

	class EWC:
	"""
	Elastic Weight Consolidation: Penalize changes to important weights.

	Key idea: Some weights are more important for previous tasks.
	Constrain important weights to stay close to their old values.
	"""

	def __init__(self, model, fisher_diagonal=None):
	self.model = model
	self.fisher_diagonal = fisher_diagonal # Importance weights
	self.optimal_weights = None # Weights after previous task

	def estimate_fisher_information(self, data_loader, device='cpu'):
	"""
	Estimate Fisher Information Matrix diagonal.

	Fisher Information measures how sensitive the loss is to each parameter.
	High Fisher = parameter is important, should not change much.
	"""
	self.model.eval()

	# Initialize Fisher as zeros
	fisher = {
	name: torch.zeros_like(param)
	for name, param in self.model.named_parameters()
	if param.requires_grad
	}

	# Accumulate squared gradients
	for batch in data_loader:
	self.model.zero_grad()

	# Get predictions
	outputs = self.model(batch['texts'])
	loss = F.cross_entropy(outputs, batch['labels'])

	# Compute gradients
	loss.backward()

	# Square and accumulate gradients (Fisher diagonal approximation)
	for name, param in self.model.named_parameters():
	if param.grad is not None:
	fisher[name] += param.grad.pow(2)

	# Average over samples
	n_samples = len(data_loader.dataset)
	for name in fisher:
	fisher[name] /= n_samples

	self.fisher_diagonal = fisher

	# Store optimal weights (current weights after training)
	self.optimal_weights = {
	name: param.clone().detach()
	for name, param in self.model.named_parameters()
	if param.requires_grad
	}

	return fisher

	def ewc_loss(self, lambda_ewc=1000):
	"""
	Calculate EWC regularization loss.

	L_ewc = Σ_i F_i * (θ_i - θ*_i)^2

	where:
	- F_i is Fisher Information for parameter i
	- θ_i is current parameter value
	- θ*_i is optimal parameter value from previous task
	"""
	if self.fisher_diagonal is None or self.optimal_weights is None:
	return torch.tensor(0.0)

	ewc_loss = 0

	for name, param in self.model.named_parameters():
	if param.requires_grad and name in self.fisher_diagonal:
	# Squared distance from optimal weights, weighted by Fisher
	diff = param - self.optimal_weights[name]
	ewc_loss += (self.fisher_diagonal[name] * diff.pow(2)).sum()

	return lambda_ewc * ewc_loss

	def save_checkpoint(self, path):
	"""Save EWC state (Fisher and optimal weights)."""
	torch.save({
	'fisher_diagonal': self.fisher_diagonal,
	'optimal_weights': self.optimal_weights
	}, path)

	def load_checkpoint(self, path):
	"""Load EWC state."""
	checkpoint = torch.load(path)
	self.fisher_diagonal = checkpoint['fisher_diagonal']
	self.optimal_weights = checkpoint['optimal_weights']


	class EWC_Trainer:
	"""Trainer with EWC regularization for continual learning."""

	def __init__(self, model, config, lambda_ewc=1000):
	self.model = model
	self.config = config
	self.lambda_ewc = lambda_ewc
	self.ewc = EWC(model)
	self.optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

	def train_on_task(self, task_data_loader, task_id, epochs=5):
	"""Train on new task with EWC regularization."""
	print(f"\nTraining Task {task_id} with EWC...")

	for epoch in range(epochs):
	total_loss = 0
	total_ewc_loss = 0

	for batch in task_data_loader:
	self.model.train()
	self.optimizer.zero_grad()

	# Standard task loss
	outputs = self.model(batch['texts'])
	task_loss = F.cross_entropy(outputs, batch['labels'])

	# EWC regularization loss
	ewc_loss = self.ewc.ewc_loss(self.lambda_ewc)

	# Total loss
	total = task_loss + ewc_loss

	total.backward()
	self.optimizer.step()

	total_loss += task_loss.item()
	total_ewc_loss += ewc_loss.item()

	avg_loss = total_loss / len(task_data_loader)
	avg_ewc = total_ewc_loss / len(task_data_loader)

	print(f"Epoch {epoch + 1}/{epochs}:")
	print(f" Task Loss: {avg_loss:.4f}")
	print(f" EWC Loss: {avg_ewc:.4f}")

	# After training, update Fisher and optimal weights
	print("Updating Fisher Information Matrix...")
	self.ewc.estimate_fisher_information(task_data_loader)
	self.ewc.save_checkpoint(f'ewc_checkpoint_task{task_id}.pth')
	```

	### Synaptic Intelligence (SI)

	```python
	class SynapticIntelligence:
	"""
	Synaptic Intelligence: Track parameter importance during training.

	Unlike EWC which computes Fisher after training,
	SI accumulates importance online during training.
	"""

	def __init__(self, model):
	self.model = model
	self.importance = {
	name: torch.zeros_like(param)
	for name, param in model.named_parameters()
	if param.requires_grad
	}
	self.previous_params = {
	name: param.clone().detach()
	for name, param in model.named_parameters()
	if param.requires_grad
	}
	self.delta_loss = {name: torch.zeros_like(param) for name in self.importance}

	def update_importance(self, loss_change):
	"""
	Update parameter importance based on contribution to loss decrease.
	"""
	for name, param in self.model.named_parameters():
	if param.requires_grad and param.grad is not None:
	# Change in parameter
	delta_param = param.detach() - self.previous_params[name]

	# Contribution to loss decrease (approximated)
	contribution = -param.grad * delta_param * loss_change

	# Accumulate importance
	self.importance[name] += contribution.abs()

	def si_loss(self, c_si=100):
	"""
	Calculate SI regularization loss.

	L_si = Σ_j Ω_j * (θ_j - θ*_j)^2

	where Ω_j is accumulated importance for parameter j.
	"""
	si_loss = 0

	for name, param in self.model.named_parameters():
	if param.requires_grad and name in self.importance:
	diff = param - self.previous_params[name]
	si_loss += (self.importance[name] * diff.pow(2)).sum()

	return c_si * si_loss

	def update_previous_params(self):
	"""Store current parameters as previous after training step."""
	for name, param in self.model.named_parameters():
	if param.requires_grad:
	self.previous_params[name] = param.clone().detach()
	```

	### Learning without Forgetting (LwF)

	```python
	class LearningWithoutForgetting:
	"""
	Learning without Forgetting: Use knowledge distillation from old model.

	Keep a copy of the old model and distill its knowledge
	while training on new data.
	"""

	def __init__(self, model):
	self.model = model
	self.old_model = None # Copy of model from previous task

	def create_old_model_copy(self):
	"""Create a frozen copy of current model before training on new task."""
	import copy
	self.old_model = copy.deepcopy(self.model)

	# Freeze old model
	for param in self.old_model.parameters():
	param.requires_grad = False

	self.old_model.eval()

	def distillation_loss(self, texts, temperature=2.0):
	"""
	Calculate distillation loss to match old model's outputs.

	Uses softened probability distributions (with temperature)
	to capture dark knowledge.
	"""
	if self.old_model is None:
	return torch.tensor(0.0)

	self.model.train()
	self.old_model.eval()

	with torch.no_grad():
	# Old model's soft predictions
	old_logits = self.old_model.get_logits(texts)
	old_soft_probs = F.softmax(old_logits / temperature, dim=-1)

	# Current model's predictions
	current_logits = self.model.get_logits(texts)
	current_soft_probs = F.log_softmax(current_logits / temperature, dim=-1)

	# KL divergence between distributions
	dist_loss = F.kl_div(
	current_soft_probs,
	old_soft_probs,
	reduction='batchmean'
	) * (temperature ** 2)

	return dist_loss

	def combined_loss(self, task_loss, dist_loss, alpha=0.5):
	"""
	Combine task loss and distillation loss.

	L_total = α * L_task + (1-α) * L_distill
	"""
	return alpha * task_loss + (1 - alpha) * dist_loss
	```

	---

	## Architecture-Based Methods

	### Progressive Neural Networks

	```python
	class ProgressiveNeuralNetwork(nn.Module):
	"""
	Progressive Neural Networks: Add new columns for new tasks.

	Each task gets its own neural network column.
	Columns can read from previous columns but not modify them.
	"""

	def __init__(self, base_column_class, config):
	super().__init__()
	self.base_column_class = base_column_class
	self.config = config
	self.columns = nn.ModuleList()
	self.task_to_column = {}

	def add_task_column(self, task_id):
	"""Add a new column for a new task."""
	# Create new column
	new_column = self.base_column_class(self.config)

	# If there are previous columns, add lateral connections
	if len(self.columns) > 0:
	new_column.add_lateral_connections(self.columns)

	self.columns.append(new_column)
	self.task_to_column[task_id] = len(self.columns) - 1

	# Freeze all previous columns
	for col_idx in range(len(self.columns) - 1):
	for param in self.columns[col_idx].parameters():
	param.requires_grad = False

	print(f"Added column {len(self.columns) - 1} for task {task_id}")

	def forward(self, x, task_id):
	"""Forward pass through appropriate column."""
	column_idx = self.task_to_column[task_id]
	return self.columns[column_idx](x)

	def get_total_parameters(self):
	"""Count total parameters across all columns."""
	return sum(p.numel() for col in self.columns for p in col.parameters())
	```

	### Adapter Modules

	```python
	class AdapterModule(nn.Module):
	"""
	Adapter modules: Small trainable modules inserted into frozen backbone.

	Keep pretrained model frozen, only train lightweight adapters.
	Different adapters for different tasks.
	"""

	def __init__(self, hidden_dim, adapter_dim=64):
	super().__init__()
	self.down_project = nn.Linear(hidden_dim, adapter_dim)
	self.activation = nn.GELU()
	self.up_project = nn.Linear(adapter_dim, hidden_dim)
	self.layer_norm = nn.LayerNorm(hidden_dim)

	def forward(self, x):
	# Residual connection
	residual = x

	# Adapter bottleneck
	x = self.down_project(x)
	x = self.activation(x)
	x = self.up_project(x)

	# Add residual and normalize
	x = self.layer_norm(x + residual)

	return x


	class AdapterConfig:
	"""Configuration for adapter-based fine-tuning."""

	def __init__(self,
	adapter_dim=64,
	adapter_locations='all_layers',
	freeze_backbone=True,
	task_adapters=True):
	self.adapter_dim = adapter_dim
	self.adapter_locations = adapter_locations # 'all_layers', 'last_n', etc.
	self.freeze_backbone = freeze_backbone
	self.task_adapters = task_adapters # Separate adapter per task


	def insert_adapters_into_transformer(transformer_model, config):
	"""Insert adapter modules into a transformer model."""

	if config.freeze_backbone:
	# Freeze all backbone parameters
	for param in transformer_model.parameters():
	param.requires_grad = False

	adapters = {}

	# Insert adapters into each layer
	for layer_idx, layer in enumerate(transformer_model.layers):
	if config.adapter_locations == 'all_layers' or \
	(config.adapter_locations.startswith('last_') and
	layer_idx >= len(transformer_model.layers) - int(config.adapter_locations[5:])):

	# Create adapter
	adapter = AdapterModule(
	hidden_dim=layer.hidden_dim,
	adapter_dim=config.adapter_dim
	)

	# Insert after attention sublayer
	layer.insert_adapter(adapter)

	adapters[f'layer_{layer_idx}'] = adapter

	return adapters
	```

	### Dynamic Architecture Expansion

	```python
	class DynamicArchitectureExpansion:
	"""
	Dynamically expand model capacity when needed.

	Monitor performance; if degradation detected, add capacity.
	"""

	def __init__(self, model, expansion_threshold=0.05):
	self.model = model
	self.expansion_threshold = expansion_threshold
	self.baseline_performance = None
	self.expansion_history = []

	def monitor_and_expand(self, validation_data, current_task_id):
	"""
	Check if model needs expansion based on performance drop.
	"""
	current_performance = self.evaluate(self.model, validation_data)

	if self.baseline_performance is not None:
	performance_drop = self.baseline_performance - current_performance

	if performance_drop > self.expansion_threshold:
	print(f"Performance drop detected: {performance_drop:.4f}")
	print("Expanding model architecture...")

	self.expand_architecture(current_task_id)

	# Re-evaluate after expansion
	new_performance = self.evaluate(self.model, validation_data)
	print(f"Performance after expansion: {new_performance:.4f}")

	self.baseline_performance = current_performance

	def expand_architecture(self, task_id):
	"""Add new capacity to the model."""
	# Strategy 1: Add new neurons to hidden layers
	# Strategy 2: Add new layers
	# Strategy 3: Add task-specific heads

	# Example: Add task-specific output head
	new_head = nn.Linear(self.model.hidden_dim, self.model.num_new_classes)
	setattr(self.model, f'task_{task_id}_head', new_head)

	self.expansion_history.append({
	'task_id': task_id,
	'expansion_type': 'new_head',
	'timestamp': len(self.expansion_history)
	})
	```

	---

	## Incremental Training Strategies

	### Scheduled Fine-Tuning

	```python
	class ScheduledFineTuner:
	"""
	Schedule fine-tuning with decreasing learning rates and selective unfreezing.
	"""

	def __init__(self, model, config):
	self.model = model
	self.config = config
	self.training_history = []

	def progressive_unfreezing(self, n_stages=3):
	"""
	Progressively unfreeze layers from top to bottom.

	Stage 1: Only train top layers
	Stage 2: Train top + middle layers
	Stage 3: Train all layers
	"""
	total_layers = len(self.model.layers)
	layers_per_stage = total_layers // n_stages

	schedule = []

	for stage in range(n_stages):
	# Number of layers to unfreeze
	n_unfrozen = (stage + 1) * layers_per_stage

	# Freeze/unfreeze accordingly
	for i, layer in enumerate(self.model.layers):
	if i < total_layers - n_unfrozen:
	for param in layer.parameters():
	param.requires_grad = False
	else:
	for param in layer.parameters():
	param.requires_grad = True

	# Learning rate decreases with deeper unfreezing
	lr = self.config.base_lr * (0.5 ** (n_stages - stage - 1))

	schedule.append({
	'stage': stage,
	'n_unfrozen': n_unfrozen,
	'learning_rate': lr
	})

	return schedule

	def train_with_schedule(self, data_loader, schedule):
	"""Train using progressive unfreezing schedule."""
	for stage_config in schedule:
	print(f"\nStage {stage_config['stage'] + 1}: "
	f"Unfreezing {stage_config['n_unfrozen']} layers")

	# Set learning rate
	for param_group in self.optimizer.param_groups:
	param_group['lr'] = stage_config['learning_rate']

	# Train for this stage
	self.train_epoch(data_loader)

	# Record history
	self.training_history.append(stage_config)
	```

	### Curriculum Learning for Continual Learning

	```python
	class CurriculumContinualLearner:
	"""
	Apply curriculum learning principles to continual learning.

	Order tasks/examples from easy to hard to facilitate transfer.
	"""

	def __init__(self, model):
	self.model = model
	self.task_difficulty = {}

	def estimate_task_difficulty(self, task_data):
	"""
	Estimate difficulty of a task based on initial performance.
	"""
	self.model.eval()

	correct = 0
	total = 0

	with torch.no_grad():
	for batch in task_data:
	outputs = self.model(batch['texts'])
	predictions = outputs.argmax(dim=-1)

	correct += (predictions == batch['labels']).sum().item()
	total += len(batch['labels'])

	initial_accuracy = correct / total

	# Difficulty inversely related to accuracy
	difficulty = 1.0 - initial_accuracy

	return difficulty

	def order_tasks_by_curriculum(self, tasks):
	"""
	Order tasks from easiest to hardest.
	"""
	# Estimate difficulty for each task
	for task_id, task_data in tasks.items():
	self.task_difficulty[task_id] = self.estimate_task_difficulty(task_data)

	# Sort by difficulty (easy to hard)
	ordered_tasks = sorted(
	tasks.items(),
	key=lambda x: self.task_difficulty[x[0]]
	)

	print("Task Curriculum (Easy → Hard):")
	for task_id, _ in ordered_tasks:
	print(f" {task_id}: difficulty = {self.task_difficulty[task_id]:.3f}")

	return ordered_tasks

	def train_with_curriculum(self, ordered_tasks):
	"""Train on tasks in curriculum order."""
	for task_id, task_data in ordered_tasks:
	print(f"\n{'='*50}")
	print(f"Training on task: {task_id}")
	print(f"Difficulty: {self.task_difficulty[task_id]:.3f}")
	print(f"{'='*50}")

	# Train on this task
	self.train_on_task(task_data, task_id)
	```

	---

	## Model Versioning

	### Semantic Versioning for Models

	```python
	from datetime import datetime
	import json
	import hashlib

	class ModelVersion:
	"""
	Semantic versioning for ML models.

	Format: MAJOR.MINOR.PATCH

	- MAJOR: Breaking changes (architecture change, incompatible API)
	- MINOR: New features, improved performance (backward compatible)
	- PATCH: Bug fixes, minor improvements (fully backward compatible)
	"""

	def __init__(self, major=0, minor=0, patch=0, metadata=None):
	self.major = major
	self.minor = minor
	self.patch = patch
	self.metadata = metadata or {}
	self.created_at = datetime.now().isoformat()

	def bump_major(self):
	"""Increment major version (breaking change)."""
	self.major += 1
	self.minor = 0
	self.patch = 0

	def bump_minor(self):
	"""Increment minor version (new feature)."""
	self.minor += 1
	self.patch = 0

	def bump_patch(self):
	"""Increment patch version (bug fix)."""
	self.patch += 1

	def __str__(self):
	version_str = f"{self.major}.{self.minor}.{self.patch}"

	if self.metadata:
	metadata_str = '+'.join(f"{k}={v}" for k, v in self.metadata.items())
	version_str += f"+{metadata_str}"

	return version_str

	def to_dict(self):
	return {
	'version': str(self),
	'major': self.major,
	'minor': self.minor,
	'patch': self.patch,
	'metadata': self.metadata,
	'created_at': self.created_at
	}

	@classmethod
	def from_string(cls, version_str):
	"""Parse version string to ModelVersion object."""
	# Simple parsing (can be extended for metadata)
	parts = version_str.split('+')[0].split('.')
	return cls(
	major=int(parts[0]),
	minor=int(parts[1]) if len(parts) > 1 else 0,
	patch=int(parts[2]) if len(parts) > 2 else 0
	)


	class ModelRegistry:
	"""
	Centralized registry for model versions and artifacts.
	"""

	def __init__(self, registry_path='./model_registry'):
	self.registry_path = Path(registry_path)
	self.registry_path.mkdir(parents=True, exist_ok=True)
	self.models = {} # {model_name: {version: metadata}}
	self.load_registry()

	def register_model(self, model_name, version, model_path, metrics, metadata=None):
	"""Register a new model version."""
	if model_name not in self.models:
	self.models[model_name] = {}

	# Calculate model hash for integrity
	model_hash = self.calculate_file_hash(model_path)

	# Create metadata
	model_metadata = {
	'version': str(version),
	'model_path': str(model_path),
	'model_hash': model_hash,
	'metrics': metrics,
	'metadata': metadata or {},
	'registered_at': datetime.now().isoformat(),
	'status': 'active' # active, deprecated, archived
	}

	self.models[model_name][str(version)] = model_metadata

	# Save registry
	self.save_registry()

	print(f"Registered {model_name} v{version}")
	print(f" Path: {model_path}")
	print(f" Hash: {model_hash[:16]}...")
	print(f" Metrics: {metrics}")

	return model_metadata

	def get_latest_version(self, model_name):
	"""Get the latest active version of a model."""
	if model_name not in self.models:
	return None

	active_versions = [
	v for v, m in self.models[model_name].items()
	if m['status'] == 'active'
	]

	if not active_versions:
	return None

	# Sort by version number
	latest = max(active_versions, key=lambda v: ModelVersion.from_string(v))
	return latest

	def deprecate_version(self, model_name, version_str):
	"""Mark a model version as deprecated."""
	if model_name in self.models and version_str in self.models[model_name]:
	self.models[model_name][version_str]['status'] = 'deprecated'
	self.models[model_name][version_str]['deprecated_at'] = datetime.now().isoformat()
	self.save_registry()
	print(f"Deprecated {model_name} v{version_str}")

	def calculate_file_hash(self, file_path):
	"""Calculate SHA256 hash of model file."""
	sha256_hash = hashlib.sha256()
	with open(file_path, "rb") as f:
	for byte_block in iter(lambda: f.read(4096), b""):
	sha256_hash.update(byte_block)
	return sha256_hash.hexdigest()

	def save_registry(self):
	"""Save registry to disk."""
	registry_file = self.registry_path / 'registry.json'
	with open(registry_file, 'w') as f:
	json.dump(self.models, f, indent=2)

	def load_registry(self):
	"""Load registry from disk."""
	registry_file = self.registry_path / 'registry.json'
	if registry_file.exists():
	with open(registry_file, 'r') as f:
	self.models = json.load(f)

	def list_models(self):
	"""List all registered models and versions."""
	print("Model Registry")
	print("=" * 70)

	for model_name, versions in self.models.items():
	print(f"\n{model_name}:")
	for version_str, metadata in sorted(versions.items()):
	status = metadata['status']
	metrics = metadata.get('metrics', {})
	accuracy = metrics.get('accuracy', 'N/A')

	print(f" v{version_str:15} [{status:10}] Acc: {accuracy}")

	print("=" * 70)
	```

	---

	## Deployment Strategies

	### Canary Deployments

	```python
	class CanaryDeployment:
	"""
	Gradually roll out new model version to subset of traffic.
	"""

	def __init__(self, old_model, new_model, initial_percentage=5):
	self.old_model = old_model
	self.new_model = new_model
	self.canary_percentage = initial_percentage
	self.deployment_log = []

	def route_request(self, request):
	"""Route request to old or new model based on canary percentage."""
	if random.random() < self.canary_percentage / 100:
	model = self.new_model
	variant = 'canary'
	else:
	model = self.old_model
	variant = 'stable'

	prediction = model.predict(request)

	# Log for monitoring
	self.deployment_log.append({
	'timestamp': datetime.now().isoformat(),
	'variant': variant,
	'request_id': request.get('id'),
	'prediction': prediction
	})

	return prediction

	def increase_canary(self, increment=10):
	"""Increase canary traffic percentage."""
	self.canary_percentage = min(100, self.canary_percentage + increment)
	print(f"Canary traffic increased to {self.canary_percentage}%")

	def rollback(self):
	"""Rollback to 100% old model."""
	self.canary_percentage = 0
	print("Rolled back to stable model")

	def analyze_canary_performance(self, ground_truth):
	"""Compare performance of canary vs stable."""
	canary_correct = 0
	canary_total = 0
	stable_correct = 0
	stable_total = 0

	for log_entry in self.deployment_log:
	# Match with ground truth (simplified)
	is_correct = check_prediction(log_entry, ground_truth)

	if log_entry['variant'] == 'canary':
	canary_correct += is_correct
	canary_total += 1
	else:
	stable_correct += is_correct
	stable_total += 1

	canary_acc = canary_correct / canary_total if canary_total > 0 else 0
	stable_acc = stable_correct / stable_total if stable_total > 0 else 0

	print(f"Canary Accuracy: {canary_acc:.4f} (n={canary_total})")
	print(f"Stable Accuracy: {stable_acc:.4f} (n={stable_total})")

	improvement = canary_acc - stable_acc

	if improvement > 0.02: # 2% improvement threshold
	print("✅ Canary performing better - consider increasing traffic")
	return 'promote'
	elif improvement < -0.02:
	print("❌ Canary performing worse - consider rollback")
	return 'rollback'
	else:
	print("⚠️ Similar performance - continue monitoring")
	return 'monitor'
	```

	### Blue-Green Deployment

	```python
	class BlueGreenDeployment:
	"""
	Maintain two identical production environments.

	- Blue: Currently serving all traffic
	- Green: Idle environment with new model

	Switch traffic instantly when ready.
	"""

	def __init__(self):
	self.active_environment = 'blue'
	self.environments = {
	'blue': {'model': None, 'status': 'inactive'},
	'green': {'model': None, 'status': 'inactive'}
	}

	def deploy_to_inactive(self, new_model):
	"""Deploy new model to inactive environment."""
	inactive_env = 'green' if self.active_environment == 'blue' else 'blue'

	self.environments[inactive_env]['model'] = new_model
	self.environments[inactive_env]['status'] = 'ready'

	print(f"Deployed new model to {inactive_env} environment")

	def switch_traffic(self):
	"""Switch all traffic to the other environment."""
	old_active = self.active_environment
	self.active_environment = 'green' if self.active_environment == 'blue' else 'blue'

	self.environments[old_active]['status'] = 'inactive'
	self.environments[self.active_environment]['status'] = 'active'

	print(f"Traffic switched from {old_active} to {self.active_environment}")

	def predict(self, request):
	"""Route request to active environment."""
	model = self.environments[self.active_environment]['model']
	return model.predict(request)

	def rollback(self):
	"""Quick rollback by switching environments."""
	self.switch_traffic()
	print("Rolled back to previous environment")
	```

	### Shadow Mode Deployment

	```python
	class ShadowModeDeployment:
	"""
	Run new model in shadow mode alongside production.

	New model receives all requests but doesn't serve predictions.
	Used for validation without risk.
	"""

	def __init__(self, production_model, shadow_model):
	self.production_model = production_model
	self.shadow_model = shadow_model
	self.shadow_predictions = []

	def predict(self, request):
	"""Serve from production, record shadow predictions."""
	# Production prediction (served to user)
	production_pred = self.production_model.predict(request)

	# Shadow prediction (recorded only)
	shadow_pred = self.shadow_model.predict(request)

	# Store for analysis
	self.shadow_predictions.append({
	'request': request,
	'production': production_pred,
	'shadow': shadow_pred,
	'timestamp': datetime.now().isoformat()
	})

	return production_pred

	def analyze_discrepancies(self, ground_truth=None):
	"""Analyze differences between production and shadow."""
	discrepancies = 0
	total = len(self.shadow_predictions)

	for entry in self.shadow_predictions:
	if entry['production'] != entry['shadow']:
	discrepancies += 1

	discrepancy_rate = discrepancies / total if total > 0 else 0

	print(f"Shadow Mode Analysis:")
	print(f" Total requests: {total}")
	print(f" Discrepancies: {discrepancies} ({discrepancy_rate:.2%})")

	if ground_truth:
	# Evaluate which model performed better
	prod_correct = sum(
	1 for e, gt in zip(self.shadow_predictions, ground_truth)
	if e['production'] == gt
	)
	shadow_correct = sum(
	1 for e, gt in zip(self.shadow_predictions, ground_truth)
	if e['shadow'] == gt
	)

	print(f" Production accuracy: {prod_correct/total:.4f}")
	print(f" Shadow accuracy: {shadow_correct/total:.4f}")

	return discrepancy_rate
	```

	---

	## Production Monitoring

	### Real-Time Performance Monitoring

	```python
	import time
	from collections import defaultdict, deque

	class ProductionMonitor:
	"""
	Monitor model performance in production.
	"""

	def __init__(self, window_size=1000):
	self.window_size = window_size

	# Metrics windows
	self.latency_window = deque(maxlen=window_size)
	self.throughput_window = deque(maxlen=window_size)
	self.prediction_distribution = defaultdict(int)
	self.confidence_window = deque(maxlen=window_size)

	# Alerts
	self.alerts = []
	self.alert_thresholds = {
	'latency_p99': 1000, # ms
	'throughput_min': 10, # requests/sec
	'confidence_low': 0.5
	}

	def record_prediction(self, prediction, confidence, latency_ms):
	"""Record a prediction event."""
	timestamp = time.time()

	# Record metrics
	self.latency_window.append((timestamp, latency_ms))
	self.prediction_distribution[prediction] += 1
	self.confidence_window.append((timestamp, confidence))
	self.throughput_window.append(timestamp)

	# Check for alerts
	self.check_alerts()

	def check_alerts(self):
	"""Check if any metrics exceed thresholds."""
	current_time = time.time()

	# P99 Latency
	latencies = [l for _, l in self.latency_window]
	if latencies:
	p99_latency = np.percentile(latencies, 99)
	if p99_latency > self.alert_thresholds['latency_p99']:
	self.create_alert('HIGH_LATENCY', f"P99 latency: {p99_latency:.0f}ms")

	# Throughput
	recent_throughput = sum(
	1 for t in self.throughput_window
	if current_time - t < 1.0 # Last second
	)
	if recent_throughput < self.alert_thresholds['throughput_min']:
	self.create_alert('LOW_THROUGHPUT', f"Throughput: {recent_throughput} req/s")

	# Low confidence
	confidences = [c for _, c in self.confidence_window]
	if confidences:
	low_conf_ratio = np.mean([c < self.alert_thresholds['confidence_low'] for c in confidences])
	if low_conf_ratio > 0.2: # More than 20% low confidence
	self.create_alert('LOW_CONFIDENCE', f"Low confidence ratio: {low_conf_ratio:.2%}")

	def create_alert(self, alert_type, message):
	"""Create an alert."""
	alert = {
	'type': alert_type,
	'message': message,
	'timestamp': datetime.now().isoformat()
	}
	self.alerts.append(alert)
	print(f"🚨 ALERT [{alert_type}]: {message}")

	def get_dashboard_metrics(self):
	"""Get current metrics for dashboard."""
	current_time = time.time()

	# Latency stats
	latencies = [l for _, l in self.latency_window]
	latency_stats = {
	'mean': np.mean(latencies) if latencies else 0,
	'p50': np.percentile(latencies, 50) if latencies else 0,
	'p95': np.percentile(latencies, 95) if latencies else 0,
	'p99': np.percentile(latencies, 99) if latencies else 0
	}

	# Throughput
	throughput = sum(1 for t in self.throughput_window if current_time - t < 1.0)

	# Confidence stats
	confidences = [c for _, c in self.confidence_window]
	confidence_stats = {
	'mean': np.mean(confidences) if confidences else 0,
	'std': np.std(confidences) if confidences else 0
	}

	# Prediction distribution
	total_preds = sum(self.prediction_distribution.values())
	pred_distribution = {
	k: v / total_preds if total_preds > 0 else 0
	for k, v in self.prediction_distribution.items()
	}

	return {
	'latency': latency_stats,
	'throughput': throughput,
	'confidence': confidence_stats,
	'prediction_distribution': pred_distribution,
	'active_alerts': len([a for a in self.alerts if a['timestamp'] > str(current_time - 3600)])
	}
	```

	### Drift Detection in Production

	```python
	class ProductionDriftDetector:
	"""
	Detect data drift and concept drift in production.
	"""

	def __init__(self, reference_data, model, detection_method='ks_test'):
	self.reference_data = reference_data
	self.model = model
	self.detection_method = detection_method

	# Reference statistics
	self.reference_stats = self.compute_reference_statistics()

	# Production data window
	self.production_window = deque(maxlen=1000)

	# Drift alerts
	self.drift_alerts = []

	def compute_reference_statistics(self):
	"""Compute statistics from reference (training) data."""
	stats = {}

	# Feature statistics
	features = self.extract_features(self.reference_data)
	stats['feature_means'] = np.mean(features, axis=0)
	stats['feature_stds'] = np.std(features, axis=0)

	# Prediction distribution
	reference_preds = self.model.predict(self.reference_data)
	stats['prediction_distribution'] = np.bincount(reference_preds) / len(reference_preds)

	# Confidence distribution
	reference_confs = self.model.get_confidences(self.reference_data)
	stats['confidence_mean'] = np.mean(reference_confs)
	stats['confidence_std'] = np.std(reference_confs)

	return stats

	def extract_features(self, data):
	"""Extract features from data for drift detection."""
	# Implementation depends on data type
	# Could be raw features, embeddings, etc.
	pass

	def detect_feature_drift(self, new_data):
	"""Detect drift in input features."""
	new_features = self.extract_features(new_data)

	if self.detection_method == 'ks_test':
	# Kolmogorov-Smirnov test for each feature
	drift_scores = []
	p_values = []

	for i in range(new_features.shape[1]):
	stat, p_val = stats.ks_2samp(
	self.reference_data[:, i],
	new_features[:, i]
	)
	drift_scores.append(stat)
	p_values.append(p_val)

	# Significant drift if many features have p < 0.05
	significant_drift = np.mean([p < 0.05 for p in p_values])

	return {
	'drift_detected': significant_drift > 0.3, # 30% features drifted
	'drift_scores': drift_scores,
	'p_values': p_values,
	'fraction_drifted': significant_drift
	}

	elif self.detection_method == 'population_stability_index':
	# PSI for categorical features
	pass

	def detect_prediction_drift(self, new_predictions):
	"""Detect drift in prediction distribution."""
	new_distribution = np.bincount(new_predictions) / len(new_predictions)

	# Ensure same length
	max_len = max(len(self.reference_stats['prediction_distribution']),
	len(new_distribution))

	ref_dist = np.zeros(max_len)
	new_dist = np.zeros(max_len)

	ref_dist[:len(self.reference_stats['prediction_distribution'])] = \
	self.reference_stats['prediction_distribution']
	new_dist[:len(new_distribution)] = new_distribution

	# KL divergence
	epsilon = 1e-10
	kl_div = np.sum(ref_dist * np.log((ref_dist + epsilon) / (new_dist + epsilon)))

	# Jensen-Shannon divergence (symmetric)
	js_div = 0.5 * kl_div + 0.5 * np.sum(new_dist * np.log((new_dist + epsilon) / (ref_dist + epsilon)))

	drift_detected = js_div > 0.1 # Threshold

	return {
	'drift_detected': drift_detected,
	'js_divergence': js_div,
	'reference_distribution': ref_dist,
	'new_distribution': new_dist
	}

	def monitor(self, new_data, new_predictions):
	"""Run all drift detection methods."""
	results = {}

	# Feature drift
	results['feature_drift'] = self.detect_feature_drift(new_data)

	# Prediction drift
	results['prediction_drift'] = self.detect_prediction_drift(new_predictions)

	# Overall drift decision
	overall_drift = (
	results['feature_drift']['drift_detected'] or
	results['prediction_drift']['drift_detected']
	)

	if overall_drift:
	self.create_drift_alert(results)

	return {
	'drift_detected': overall_drift,
	'details': results
	}

	def create_drift_alert(self, drift_results):
	"""Create drift alert."""
	alert = {
	'timestamp': datetime.now().isoformat(),
	'feature_drift': drift_results['feature_drift'],
	'prediction_drift': drift_results['prediction_drift']
	}
	self.drift_alerts.append(alert)

	print("🚨 DRIFT DETECTED!")
	print(f" Feature drift: {drift_results['feature_drift']['fraction_drifted']:.2%}")
	print(f" Prediction drift (JS): {drift_results['prediction_drift']['js_divergence']:.4f}")
	```

	---

	## Model Retirement and Archival

	### Model Deprecation Process

	```python
	class ModelDeprecationManager:
	"""
	Manage the deprecation and retirement lifecycle of models.
	"""

	def __init__(self, model_registry):
	self.registry = model_registry
	self.deprecation_schedule = {}

	def initiate_deprecation(self, model_name, version, reason, timeline_days=90):
	"""
	Begin the deprecation process for a model version.
	"""
	deprecation_date = datetime.now()
	retirement_date = deprecation_date + timedelta(days=timeline_days)

	self.deprecation_schedule[f"{model_name}:{version}"] = {
	'model_name': model_name,
	'version': version,
	'reason': reason,
	'deprecation_date': deprecation_date.isoformat(),
	'retirement_date': retirement_date.isoformat(),
	'status': 'deprecated',
	'replacement': None,
	'migration_guide': None
	}

	# Update registry
	self.registry.deprecate_version(model_name, version)

	print(f"Initiated deprecation for {model_name} v{version}")
	print(f" Reason: {reason}")
	print(f" Deprecation date: {deprecation_date.strftime('%Y-%m-%d')}")
	print(f" Retirement date: {retirement_date.strftime('%Y-%m-%d')}")

	return self.deprecation_schedule[f"{model_name}:{version}"]

	def set_replacement(self, model_name, old_version, new_model_name, new_version):
	"""Specify replacement model for deprecated version."""
	key = f"{model_name}:{old_version}"

	if key in self.deprecation_schedule:
	self.deprecation_schedule[key]['replacement'] = {
	'model_name': new_model_name,
	'version': new_version
	}

	# Generate migration guide
	self.generate_migration_guide(model_name, old_version, new_model_name, new_version)

	def generate_migration_guide(self, old_model, old_version, new_model, new_version):
	"""Generate migration guide for users."""
	guide = f"""
	# Migration Guide: {old_model} v{old_version} → {new_model} v{new_version}

	## Timeline
	- Deprecation Date: {self.deprecation_schedule[f'{old_model}:{old_version}']['deprecation_date']}
	- Retirement Date: {self.deprecation_schedule[f'{old_model}:{old_version}']['retirement_date']}

	## Breaking Changes
	[List breaking changes here]

	## API Differences
	[Document API changes]

	## Performance Improvements
	[Document improvements]

	## Migration Steps
	1. Update model reference in configuration
	2. Test with new model in staging environment
	3. Validate outputs match expectations
	4. Deploy to production with canary rollout
	5. Monitor for issues

	## Support
	Contact: ml-platform@company.com
	"""

	self.deprecation_schedule[f"{old_model}:{old_version}"]["migration_guide"] = guide

	# Save guide
	guide_path = f"migration_guides/{old_model}_{old_version}_to_{new_model}_{new_version}.md"
	with open(guide_path, 'w') as f:
	f.write(guide)

	def retire_model(self, model_name, version):
	"""
	Fully retire a model version (after deprecation period).
	"""
	key = f"{model_name}:{version}"

	if key not in self.deprecation_schedule:
	print(f"No deprecation record found for {model_name}:{version}")
	return False

	deprecation_info = self.deprecation_schedule[key]
	retirement_date = datetime.fromisoformat(deprecation_info['retirement_date'])

	if datetime.now() < retirement_date:
	print(f"Cannot retire before {retirement_date.strftime('%Y-%m-%d')}")
	return False

	# Archive model
	self.archive_model(model_name, version)

	# Update status
	deprecation_info['status'] = 'retired'
	deprecation_info['retired_at'] = datetime.now().isoformat()

	print(f"Retired {model_name} v{version}")

	return True

	def archive_model(self, model_name, version):
	"""Move model to cold storage."""
	# Get model path from registry
	model_metadata = self.registry.models[model_name][version]
	model_path = Path(model_metadata['model_path'])

	# Create archive directory
	archive_dir = Path('./model_archives') / model_name / version
	archive_dir.mkdir(parents=True, exist_ok=True)

	# Move model files
	import shutil
	shutil.move(str(model_path), str(archive_dir / model_path.name))

	# Save metadata
	metadata_path = archive_dir / 'metadata.json'
	with open(metadata_path, 'w') as f:
	json.dump(model_metadata, f, indent=2)

	# Compress archive
	shutil.make_archive(str(archive_dir), 'gztar', archive_dir)

	print(f"Archived {model_name} v{version} to {archive_dir}")

	def get_deprecation_status(self, model_name, version=None):
	"""Get deprecation status for a model."""
	if version:
	key = f"{model_name}:{version}"
	return self.deprecation_schedule.get(key, None)
	else:
	# Return all versions
	return {
	k: v for k, v in self.deprecation_schedule.items()
	if v['model_name'] == model_name
	}
	```

	### Model Lineage Tracking

	```python
	class ModelLineageTracker:
	"""
	Track complete lineage of models from training to retirement.
	"""

	def __init__(self):
	self.lineage_graph = {} # {model_id: lineage_info}

	def record_training_run(self, model_id, training_config, data_version,
	code_version, hyperparameters):
	"""Record details of a training run."""
	self.lineage_graph[model_id] = {
	'model_id': model_id,
	'created_at': datetime.now().isoformat(),
	'training': {
	'config': training_config,
	'data_version': data_version,
	'code_version': code_version,
	'hyperparameters': hyperparameters,
	'environment': self.capture_environment()
	},
	'parent_models': [], # For fine-tuned models
	'child_models': [], # Models derived from this one
	'evaluation_results': {},
	'deployment_history': [],
	'retirement_info': None
	}

	def record_fine_tuning(self, child_model_id, parent_model_id,
	fine_tuning_data, fine_tuning_config):
	"""Record fine-tuning relationship."""
	if parent_model_id in self.lineage_graph:
	# Add child to parent
	self.lineage_graph[parent_model_id]['child_models'].append(child_model_id)

	# Create child record
	self.lineage_graph[child_model_id] = {
	'model_id': child_model_id,
	'created_at': datetime.now().isoformat(),
	'parent_models': [parent_model_id],
	'fine_tuning': {
	'data': fine_tuning_data,
	'config': fine_tuning_config
	},
	'child_models': [],
	'evaluation_results': {},
	'deployment_history': [],
	'retirement_info': None
	}

	def record_evaluation(self, model_id, dataset_name, metrics):
	"""Record evaluation results."""
	if model_id in self.lineage_graph:
	self.lineage_graph[model_id]['evaluation_results'][dataset_name] = {
	'metrics': metrics,
	'recorded_at': datetime.now().isoformat()
	}

	def record_deployment(self, model_id, environment, deployment_config):
	"""Record deployment event."""
	if model_id in self.lineage_graph:
	self.lineage_graph[model_id]['deployment_history'].append({
	'environment': environment,
	'config': deployment_config,
	'deployed_at': datetime.now().isoformat()
	})

	def capture_environment(self):
	"""Capture training environment details."""
	import sys
	import platform

	return {
	'python_version': sys.version,
	'platform': platform.platform(),
	'packages': self.get_installed_packages()
	}

	def get_installed_packages(self):
	"""Get list of installed packages and versions."""
	import pkg_resources
	return {
	pkg.key: pkg.version
	for pkg in pkg_resources.working_set
	}

	def get_model_lineage(self, model_id):
	"""Get complete lineage for a model."""
	if model_id not in self.lineage_graph:
	return None

	lineage = self.lineage_graph[model_id].copy()

	# Recursively get parent lineages
	if lineage['parent_models']:
	lineage['parent_lineages'] = [
	self.get_model_lineage(parent_id)
	for parent_id in lineage['parent_models']
	]

	return lineage

	def visualize_lineage(self, model_id):
	"""Visualize model lineage as a graph."""
	try:
	import graphviz
	except ImportError:
	print("Install graphviz: pip install graphviz")
	return

	dot = graphviz.Digraph(comment='Model Lineage')

	def add_node(mid):
	if mid not in self.lineage_graph:
	return

	info = self.lineage_graph[mid]
	label = f"{mid}\\n{info['created_at'][:10]}"

	if info['retirement_info']:
	dot.node(mid, label, style='filled', fillcolor='lightgray')
	else:
	dot.node(mid, label)

	# Add edges to parents
	for parent_id in info['parent_models']:
	dot.edge(parent_id, mid)
	add_node(parent_id)

	add_node(model_id)

	# Render graph
	dot.render('model_lineage.gv', view=True)
	```

	---

	## Best Practices Checklist

	### Continual Learning Best Practices

	- [ ] Choose Right Strategy: Select replay, regularization, or architecture based on constraints
	- [ ] Monitor Forgetting: Track backward transfer and forgetting metrics
	- [ ] Balance Stability-Plasticity: Tune hyperparameters for optimal balance
	- [ ] Validate Frequently: Evaluate on all tasks after each new task
	- [ ] Document Task Boundaries: Clearly define when new tasks begin
	- [ ] Plan Capacity: Ensure model has enough capacity for expected tasks
	- [ ] Test Transfer: Measure forward transfer between tasks

	### Model Lifecycle Best Practices

	- [ ] Version Everything: Models, data, code, configurations
	- [ ] Automate Deployment: Use CI/CD pipelines for model deployment
	- [ ] Monitor Continuously: Track performance, latency, drift in production
	- [ ] Plan Deprecation: Have clear retirement criteria and processes
	- [ ] Maintain Lineage: Track complete history from training to retirement
	- [ ] Document Decisions: Record why models were created, changed, retired
	- [ ] Security: Control access to model artifacts and endpoints

	---

	## Next Steps

	In the next tutorial, we'll cover:
	- Complete Production Pipeline: End-to-end example from training to serving
	- Scaling Strategies: Distributed training and inference at scale
	- Cost Optimization: Reducing training and inference costs
	- Team Collaboration: MLOps workflows for teams
	- Case Studies: Real-world examples and lessons learned