Upload folder using huggingface_hub

3e1fa1a verified 9 months ago

12.5 kB

Temporal Difference Learning Implementation

A comprehensive implementation of TD(0) learning algorithm with detailed logging, visualization, and analysis capabilities.

🎯 Overview

This project demonstrates Temporal Difference Learning, a fundamental reinforcement learning algorithm that learns state values by bootstrapping from current estimates rather than waiting for complete episodes.

🚀 Features

Complete TD(0) Implementation - Core temporal difference learning algorithm
Detailed Logging - Every TD update tracked and logged
Real-time Visualization - Value function evolution and convergence plots
Comprehensive Metrics - Training progress, TD errors, and convergence analysis
Auto-save Results - JSON export of all training data and parameters
Cross-platform Compatible - Works on Apple Silicon, Intel, and Google Colab

📦 Installation

# Install required packages
pip install numpy matplotlib

# Optional: Enhanced visualization
pip install seaborn pandas plotly

# Clone and run
git clone [your-repo-url]
cd td-learning
python td_learning.py

🎮 Quick Start

from td_learning import TDLearningEnvironment, TDLearningAgent

# Create environment and agent
env = TDLearningEnvironment(num_states=5)
agent = TDLearningAgent(num_states=5, alpha=0.1, gamma=0.9)

# Train the agent
agent.train(env, num_episodes=100)

# Visualize results
agent.visualize_training()

📊 What You'll See

Value Function Learning

The agent learns that states closer to the terminal reward have higher values:

State 0: 2.42 (starting point)
State 1: 4.85 (getting better)
State 2: 6.91 (good position)
State 3: 8.67 (almost at goal)
State 4: 0.00 (terminal state)

Training Progression

Episodes 1-20: Large TD errors (2.0+) as agent explores
Episodes 20-60: Moderate errors (1.5-2.0) as values stabilize
Episodes 60+: Small errors (<1.5) showing convergence

⚙️ Configuration

Environment Parameters

num_states: Number of states in the environment (default: 5)
terminal_state: Goal state that provides reward (default: 4)

Agent Parameters

alpha: Learning rate (0.1) - how fast to update values
gamma: Discount factor (0.9) - importance of future rewards
num_episodes: Training episodes (100)

Reward Structure

+10: Reaching terminal state
-1: Wrong action
0: Correct move toward goal

📈 Output Files

Automatic Saves

td_learning_YYYYMMDD_HHMMSS.json - Complete training data
td_learning_plots_YYYYMMDD_HHMMSS.png - Visualization plots

JSON Structure

{
  "parameters": {
    "alpha": 0.1,
    "gamma": 0.9,
    "num_states": 5
  },
  "final_values": [2.42, 4.85, 6.91, 8.67, 0.0],
  "training_metrics": {
    "episodes": [...],
    "total_rewards": [...],
    "avg_td_error": [...]
  }
}

🧠 Algorithm Details

TD(0) Update Rule

V(s) ← V(s) + α[r + γV(s') - V(s)]

Where:

V(s): Current state value estimate
α: Learning rate
r: Immediate reward
γ: Discount factor
V(s'): Next state value estimate

Key Concepts

Bootstrapping: Using current estimates to improve future estimates
Online Learning: Updates happen immediately after each experience
Temporal Difference: Learning from the difference between predictions

🔬 Experiments

Hyperparameter Testing

# Test different learning rates
for alpha in [0.01, 0.1, 0.3, 0.5]:
    agent = TDLearningAgent(num_states=5, alpha=alpha, gamma=0.9)
    agent.train(env, num_episodes=100)

Environment Variations

# Test different environment sizes
for num_states in [3, 5, 10, 20]:
    env = TDLearningEnvironment(num_states=num_states)
    agent = TDLearningAgent(num_states=num_states)
    agent.train(env, num_episodes=200)

📚 Educational Use

Perfect for:

RL Course Assignments - Clear, well-documented implementation
Research Baseline - Solid foundation for TD learning experiments
Concept Demonstration - Visual learning of value function convergence
Algorithm Comparison - Benchmark against other RL methods

🐛 Troubleshooting

Common Issues

Values not converging: Check learning rate (try α=0.1)
Oscillating values: Learning rate too high (reduce α)
Slow learning: Learning rate too low (increase α) or more episodes needed
Import errors: Install required packages with pip

Performance Tips

Faster convergence: Increase learning rate (α) but watch for instability
Better exploration: Implement ε-greedy action selection
Larger environments: Increase episode count proportionally

📖 References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.)
Chapter 6: Temporal Difference Learning
Online Book

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Richard Sutton and Andrew Barto for foundational RL theory
OpenAI Gym for environment design inspiration
Matplotlib community for visualization tools

📝 Blog Post Draft

Understanding Temporal Difference Learning: Learning to Predict the Future

How AI agents learn to estimate value using incomplete information

The Problem: Learning Without Complete Information

Imagine you're exploring a new city and trying to figure out which neighborhoods are "good" to be in. Traditional approaches might require you to complete entire walking tours before updating your opinions. But what if you could learn immediately from each step?

That's exactly what Temporal Difference (TD) Learning does for AI agents.

What Makes TD Learning Special?

Unlike Monte Carlo methods that wait for complete episodes, TD learning updates its beliefs immediately after each experience. It's like updating your restaurant ratings after each meal, rather than waiting until you've tried every dish.

The Magic Formula

V(s) ← V(s) + α[r + γV(s') - V(s)]

This simple equation captures profound learning:

V(s): "How good do I think this state is?"
r + γV(s'): "What did I just learn about this state?"
α: "How much should I trust this new information?"

Seeing TD Learning in Action

I implemented a complete TD learning system and watched it learn. Here's what happened:

Episode 1: First Discoveries

Initial values: [0.0, 0.0, 0.0, 0.0, 0.0]
After episode:  [-0.09, 0.0, -0.09, 1.0, 0.0]

The agent discovered that state 3 leads to a +10 reward and immediately updated its value!

Episode 20: Information Spreads

Values: [1.57, 4.27, 6.11, 8.88, 0.0]

Like ripples in a pond, the value information propagated backwards. States closer to the reward became more valuable.

Episode 100: Convergence

Final values: [2.42, 4.85, 6.91, 8.67, 0.0]

Perfect! The agent learned that each state's value reflects its distance from the goal.

Why This Matters

TD learning is everywhere in modern AI:

Game AI: Learning chess positions without playing complete games
Recommendation Systems: Updating preferences from immediate feedback
Autonomous Vehicles: Learning road conditions from each sensor reading
Financial Trading: Adjusting strategies from each market tick

Key Insights from Implementation

1. Bootstrap Learning Works

The agent successfully learned by using its own imperfect estimates. Like a student who gets better by checking their work against their current best understanding.

2. Gradual Convergence

TD errors started large (2.0+) and gradually decreased (1.4-), showing the algorithm naturally converging to correct values.

3. Online Learning is Powerful

No waiting for complete episodes meant faster adaptation and more efficient learning.

The Bigger Picture

TD learning represents a fundamental shift in how we think about learning:

From batch to online: Learn from each experience immediately
From certainty to estimation: Use best current guesses to improve
From complete to incremental: Make progress with partial information

This mirrors how humans actually learn - we don't wait for complete life experiences before updating our beliefs about the world.

Try It Yourself

The complete implementation is available on GitHub with detailed logging so you can watch every step of the learning process. It's fascinating to see an algorithm bootstrap itself to knowledge!

# Watch TD learning in action
agent = TDLearningAgent(alpha=0.1, gamma=0.9)
agent.train(env, num_episodes=100)
agent.visualize_training()

What's Next?

This simple TD implementation opens doors to:

Q-Learning: Learning optimal actions, not just state values
Deep TD Networks: Using neural networks for complex state spaces
Actor-Critic Methods: Combining TD learning with policy optimization

TD learning isn't just an algorithm - it's a philosophy of learning from incomplete information, which might be the most human thing about artificial intelligence.

Want to dive deeper? Check out the full implementation with step-by-step explanations and visualizations.

⚙️ Requirements File

# requirements.txt

# Core scientific computing
numpy>=1.21.0
matplotlib>=3.5.0

# Data handling and analysis
pandas>=1.3.0

# Enhanced visualization (optional)
seaborn>=0.11.0
plotly>=5.0.0

# Jupyter notebook support (optional)
jupyter>=1.0.0
ipywidgets>=7.6.0

# Development tools (optional)
pytest>=6.0.0
black>=21.0.0
flake8>=3.9.0

# Documentation (optional)
sphinx>=4.0.0
sphinx-rtd-theme>=0.5.0

📋 Installation Instructions

# Basic installation
pip install -r requirements.txt

# Or minimal installation
pip install numpy matplotlib

# For development
pip install -r requirements.txt
pip install -e .

# For Google Colab
!pip install numpy matplotlib seaborn pandas plotly

🎯 Usage Examples

# examples.py

from td_learning import TDLearningEnvironment, TDLearningAgent
import numpy as np
import matplotlib.pyplot as plt

# Example 1: Basic TD Learning
def basic_example():
    env = TDLearningEnvironment(num_states=5)
    agent = TDLearningAgent(num_states=5, alpha=0.1, gamma=0.9)
    agent.train(env, num_episodes=100)
    agent.visualize_training()
    return agent

# Example 2: Parameter Comparison
def compare_learning_rates():
    results = {}
    learning_rates = [0.01, 0.1, 0.3, 0.5]
    
    for alpha in learning_rates:
        env = TDLearningEnvironment(num_states=5)
        agent = TDLearningAgent(num_states=5, alpha=alpha, gamma=0.9)
        agent.train(env, num_episodes=100)
        results[alpha] = agent.V.copy()
    
    # Plot comparison
    plt.figure(figsize=(10, 6))
    for alpha, values in results.items():
        plt.plot(values, label=f'α={alpha}', marker='o')
    plt.xlabel('State')
    plt.ylabel('Final Value')
    plt.title('Effect of Learning Rate on Final Values')
    plt.legend()
    plt.grid(True)
    plt.show()

# Example 3: Environment Size Study
def environment_size_study():
    sizes = [3, 5, 10, 15]
    convergence_episodes = []
    
    for size in sizes:
        env = TDLearningEnvironment(num_states=size)
        agent = TDLearningAgent(num_states=size, alpha=0.1, gamma=0.9)
        agent.train(env, num_episodes=200)
        
        # Find convergence point (when TD error < 0.1)
        td_errors = agent.training_metrics['avg_td_error']
        convergence = next((i for i, error in enumerate(td_errors) if error < 0.1), 200)
        convergence_episodes.append(convergence)
    
    plt.figure(figsize=(8, 6))
    plt.plot(sizes, convergence_episodes, 'bo-')
    plt.xlabel('Environment Size (Number of States)')
    plt.ylabel('Episodes to Convergence')
    plt.title('Convergence Speed vs Environment Complexity')
    plt.grid(True)
    plt.show()

if __name__ == "__main__":
    # Run examples
    agent = basic_example()
    compare_learning_rates()
    environment_size_study()