Spaces:

kaushikvr06
/

reasoning-simulator

Build error

App Files Files Community

Kaushik Rajan commited on Jul 12

Commit

842d62b

1 Parent(s): 47b257f

Simplify codebase: focused SPIRAL TicTacToe demo with key research concepts

Browse files

Files changed (16) hide show

README.md +56 -77
app.py +340 -388
config.yaml +0 -124
data/README.md +0 -16
requirements.txt +2 -15
src/__init__.py +0 -15
src/games/__init__.py +0 -16
src/games/game_utils.py +0 -212
src/games/kuhn_poker.py +0 -314
src/games/tictactoe.py +0 -237
src/models/__init__.py +0 -13
src/reasoning/__init__.py +0 -13
src/training/__init__.py +0 -13
src/training/train_spiral.py +0 -58
tests/test_basic.py +0 -130
tests/test_games.py +0 -78

README.md CHANGED Viewed

@@ -11,104 +11,83 @@ license: apache-2.0
 short_description: An interactive reasoning game simulator
 ---
-# SPIRAL: Interactive Reasoning Game Simulator
-A practical, interactive tool based on the SPIRAL paper ("Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning") deployed on Hugging Face Spaces.
-## Overview
-This tool demonstrates how self-play training on zero-sum games can improve AI reasoning capabilities. Users can:
-- **Play Games**: Engage with AI in games like Kuhn Poker and TicTacToe
-- **View Reasoning**: See step-by-step AI reasoning traces during gameplay
-- **Test Transfer**: Evaluate AI's reasoning skills on math problems and logic puzzles
-- **Learn**: Understand AI decision-making through interactive visualizations
-## Features
-### For Non-Technical Users
-- Simple web interface for playing games
-- Visual reasoning explanations
-- Educational tutorials about AI thinking
-- No setup required - runs in browser
-### For Technical Users
-- Access to model weights and training scripts
-- API endpoints for extending the system
-- Custom game integration capabilities
-- Fine-tuning examples and documentation
-## Project Structure
-```
-SPIRAL/
-├── src/                    # Core implementation
-│   ├── games/             # Game environments
-│   ├── models/            # SPIRAL model implementation
-│   ├── training/          # Self-play training logic
-│   └── reasoning/         # Reasoning trace generation
-├── models/                # Trained model weights
-├── data/                  # Game datasets and benchmarks
-├── app/                   # Gradio web interface
-├── tests/                 # Unit and integration tests
-└── docs/                  # Documentation and tutorials
-```
-## Technology Stack
-- **Backend**: Python 3.8+
-- **ML Framework**: PyTorch, Transformers
-- **RL Library**: Gymnasium, Stable Baselines3
-- **Web Interface**: Gradio
-- **Base Model**: Qwen-4B from Hugging Face
-- **Deployment**: Hugging Face Spaces
-## Development Phases
-1. **Research and Planning** ✅
-2. **Implementation** 🔄
-3. **Testing and Optimization** 📋
-4. **Deployment and Documentation** 📋
-5. **Maintenance and Iteration** 📋
-## Getting Started
-### Prerequisites
-- Python 3.8+
-- PyTorch
-- Hugging Face account (for model access)
-### Installation
-```bash
-pip install -r requirements.txt
-```
-### Quick Start
-```bash
-python app/app.py
-```
-## Citation
-If you use this tool in your research, please cite the original SPIRAL paper:
-```bibtex
-@article{spiral2024,
-  title={Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning},
-  author={[Authors]},
-  journal={[Journal]},
-  year={2024}
-}
-```
-## License
-This project is licensed under the MIT License - see the LICENSE file for details.
-## Contributing
-We welcome contributions! Please see CONTRIBUTING.md for guidelines.
-## Support
-For issues and questions, please use the GitHub Issues or contact us via Hugging Face Spaces.

 short_description: An interactive reasoning game simulator
 ---
+# SPIRAL: Self-Play Reasoning Demo
+**Demonstrating how strategic reasoning emerges from self-play in zero-sum games**
+Based on: *"Self-Play in Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning"*
+## 🎮 Interactive Demo
+This simplified demo showcases the key concepts from the SPIRAL research through an interactive TicTacToe game. Watch as the AI demonstrates strategic reasoning using minimax tree search and explains its decision-making process.
+## 🧠 Key Concepts Demonstrated
+### Strategic Reasoning
+- AI uses minimax tree search to evaluate all possible future moves
+- Demonstrates how optimal strategies emerge from competitive gameplay
+- Shows explicit reasoning explanations for each move
+### Self-Play Learning Principles
+- Zero-sum games create competitive pressure that incentivizes strategic thinking
+- Multi-agent interactions naturally develop intelligent behavior
+- Strategic patterns emerge from repeated competitive gameplay
+### Tree Search & Planning
+- Minimax algorithm demonstrates formalized strategic reasoning
+- Look-ahead planning to evaluate future game states
+- Optimal decision-making under competitive constraints
+## 🚀 Running the Demo
+### Local Setup
+```bash
+# Clone the repository
+git clone https://huggingface.co/spaces/kaushikvr06/reasoning-simulator
+cd reasoning-simulator
+# Install dependencies
+pip install -r requirements.txt
+# Run the demo
+python app.py
+```
+### Hugging Face Spaces
+The demo is deployed and ready to use at:
+[https://huggingface.co/spaces/kaushikvr06/reasoning-simulator](https://huggingface.co/spaces/kaushikvr06/reasoning-simulator)
+## 📝 How It Works
+1. **Human Move**: Click any square to make your move as X
+2. **AI Analysis**: The AI analyzes the game tree using minimax search
+3. **Strategic Reasoning**: Watch the AI explain its decision-making process
+4. **Optimal Play**: The AI chooses the move that maximizes its winning probability
+## 🔬 Research Connection
+This demo illustrates core findings from the SPIRAL methodology:
+- **Zero-sum competitive environments** naturally incentivize strategic reasoning
+- **Multi-turn planning** emerges from the need to anticipate opponent moves
+- **Strategic reasoning capabilities** developed through self-play can transfer to general reasoning tasks
+- **Tree search algorithms** formalize the strategic reasoning process
+## 🎯 Educational Value
+Perfect for:
+- Understanding strategic AI decision-making
+- Learning about game theory and minimax algorithms
+- Exploring the connection between competition and intelligence
+- Visualizing how reasoning emerges from strategic gameplay
+## 📊 Technical Details
+- **Game Environment**: Clean TicTacToe implementation with proper state management
+- **AI Strategy**: Minimax algorithm with optimal move selection
+- **Reasoning Display**: Generated explanations of AI strategic thinking
+- **Interactive Interface**: Real-time game state updates and move explanations
+---
+*Experience firsthand how strategic reasoning emerges from competitive self-play!*

app.py CHANGED Viewed

@@ -1,103 +1,179 @@
 """
 SPIRAL: Interactive Reasoning Game Simulator
-Main Gradio application for the SPIRAL demo on Hugging Face Spaces.
 """
 import gradio as gr
 import numpy as np
 import random
-import os
-import sys
-import traceback
-import yaml
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-import torch
-import spaces
-# Add src to path for imports
-current_dir = os.path.dirname(os.path.abspath(__file__))
-src_path = os.path.join(current_dir, 'src')
-sys.path.insert(0, src_path)
-print(f"🔍 Current directory: {current_dir}")
-print(f"🔍 Source path: {src_path}")
-print(f"🔍 Python path: {sys.path[:3]}")  # Show first 3 entries
-# Check if src directory exists
-if os.path.exists(src_path):
-    print(f"✅ Source directory exists: {src_path}")
-    games_path = os.path.join(src_path, 'games')
-    if os.path.exists(games_path):
-        print(f"✅ Games directory exists: {games_path}")
-        print(f"📁 Games directory contents: {os.listdir(games_path)}")
-    else:
-        print(f"❌ Games directory not found: {games_path}")
-else:
-    print(f"❌ Source directory not found: {src_path}")
-# Try multiple import approaches
-GAMES_AVAILABLE = False
-tictactoe_env = None
-kuhn_env = None
-try:
-    # Method 1: Direct import from games module
-    print("🔄 Attempting Method 1: Direct import from games")
-    from games import TicTacToeEnv, KuhnPokerEnv
-    print("✅ Method 1 successful: Imported from games module")
-    GAMES_AVAILABLE = True
-except ImportError as e:
-    print(f"❌ Method 1 failed: {e}")
-    try:
-        # Method 2: Import from src.games
-        print("🔄 Attempting Method 2: Import from src.games")
-        from src.games import TicTacToeEnv, KuhnPokerEnv
-        print("✅ Method 2 successful: Imported from src.games")
-        GAMES_AVAILABLE = True
-    except ImportError as e:
-        print(f"❌ Method 2 failed: {e}")
-        try:
-            # Method 3: Direct file imports
-            print("🔄 Attempting Method 3: Direct file imports")
-            sys.path.insert(0, games_path)
-            from tictactoe import TicTacToeEnv
-            from kuhn_poker import KuhnPokerEnv
-            print("✅ Method 3 successful: Direct file imports")
-            GAMES_AVAILABLE = True
-        except Exception as e:
-            print(f"❌ Method 3 failed: {e}")
-            print("📋 Full traceback:", traceback.format_exc())
-if GAMES_AVAILABLE:
-    print("🎮 Game modules successfully imported!")
-    try:
-        # Test instantiation
-        tictactoe_env = TicTacToeEnv()
-        # kuhn_env = KuhnPokerEnv() # No longer needed
-        print("✅ Game environment created successfully")
-    except Exception as e:
-        print(f"❌ Error creating game environment: {e}")
-        print("📋 Full traceback:", traceback.format_exc())
-        GAMES_AVAILABLE = False
-else:
-    print("❌ All import methods failed - using fallback interface")
-# Initialize model and tokenizer as global variables
-model = None
-tokenizer = None
-def generate_reasoning(prompt):
-    """Generate reasoning trace using Qwen model."""
-    global model, tokenizer
-    if model is None or tokenizer is None:
-        return "Error: Model not loaded. Please wait for the GPU to be ready."
-    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-    outputs = model.generate(**inputs, max_length=150, do_sample=True, temperature=0.7)
-    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 def create_interface():
@@ -155,325 +231,201 @@ def create_interface():
         }
     """
-    with gr.Blocks(title="SPIRAL: Interactive Reasoning Game Simulator", theme=gr.themes.Soft(), css=css) as demo:
-        gr.Markdown("# 🎮 SPIRAL: Interactive Reasoning Game Simulator")
-        gr.Markdown("Play TicTacToe against an AI, see its step-by-step reasoning, and learn how it thinks!")
-        if GAMES_AVAILABLE:
-            def update_board_buttons():
-                """Create a list of gr.Button updates from the current board state."""
-                updates = []
-                for i in range(9):
-                    row, col = divmod(i, 3)
-                    cell = tictactoe_env.board[row, col]
-                    val = ""
-                    interactive = True
-                    if cell == 1:
-                        val = '❌'
-                        interactive = False
-                    elif cell == -1:
-                        val = '⭕'
-                        interactive = False
-                    if tictactoe_env.game_over:
-                        interactive = False
-                    updates.append(gr.Button(value=val, interactive=interactive))
-                return updates
-            # TicTacToe specific functions (no longer need get_tictactoe_board_html)
-            ttt_stats = gr.State({'wins': 0, 'losses': 0, 'draws': 0})
-            def minimax(board, player):
-                """Minimax algorithm to find the best move."""
-                # Base cases
-                winner = tictactoe_env._check_winner()
-                if winner == 1: # Human wins
-                    return -10, None
-                elif winner == -1: # AI wins
-                    return 10, None
-                elif tictactoe_env._is_draw():
-                    return 0, None
-                best_move = None
-                if player == -1: # AI is player -1 (O), maximizing player
-                    best_score = -float('inf')
-                    for move in tictactoe_env._get_valid_actions():
-                        row, col = divmod(move, 3)
-                        board[row, col] = -1
-                        score, _ = minimax(board.copy(), 1)
-                        board[row, col] = 0 # Undo move
-                        if score > best_score:
-                            best_score = score
-                            best_move = move
-                else: # Human is player 1 (X), minimizing player
-                    best_score = float('inf')
-                    for move in tictactoe_env._get_valid_actions():
-                        row, col = divmod(move, 3)
-                        board[row, col] = 1
-                        score, _ = minimax(board.copy(), -1)
-                        board[row, col] = 0 # Undo move
-                        if score < best_score:
-                            best_score = score
-                            best_move = move
-                return best_score, best_move
-            def play_tictactoe(position, stats):
-                """Play a TicTacToe move and yield updates for the button grid."""
-                if tictactoe_env.game_over:
-                    yield *update_board_buttons(), "Game is over! Click 'New Game' to start again.", "", stats
                     return
-                try:
-                    position = int(position)
-                    # Human move
-                    tictactoe_env.step(position)
-                    if tictactoe_env.game_over:
-                        winner = "You" if tictactoe_env.winner == 1 else "AI" if tictactoe_env.winner == -1 else "Draw"
-                        if winner == "You": stats['wins'] += 1
-                        elif winner == "AI": stats['losses'] += 1
-                        else: stats['draws'] += 1
-                        yield *update_board_buttons(), f"Game Over! {winner} won!", "", stats
                         return
-                    # Show "thinking" indicator
-                    yield *update_board_buttons(), "AI is thinking...", "🧠...", stats
-                    # AI move
-                    _, ai_action = minimax(tictactoe_env.board.copy(), -1)
-                    if ai_action is None:
-                        valid_actions = tictactoe_env._get_valid_actions()
-                        if not valid_actions:
-                             yield *update_board_buttons(), "Game is a draw!", "", stats
-                             return
-                        ai_action = random.choice(valid_actions)
-                    reasoning_prompt = f"In TicTacToe, the board is currently: {tictactoe_env.board.flatten().tolist()}. The human player (X) played position {position}. I am the AI (O). The available moves are {tictactoe_env._get_valid_actions()}. I have analyzed the game tree using minimax and determined the optimal move is {ai_action}. Explain my strategy."
-                    reasoning = generate_reasoning(reasoning_prompt)
-                    tictactoe_env.step(ai_action)
-                    if tictactoe_env.game_over:
-                        winner = "You" if tictactoe_env.winner == 1 else "AI" if tictactoe_env.winner == -1 else "Draw"
-                        if winner == "You": stats['wins'] += 1
-                        elif winner == "AI": stats['losses'] += 1
-                        else: stats['draws'] += 1
-                        yield *update_board_buttons(), f"Game Over! {winner} won! AI played {ai_action}.", reasoning, stats
-                    else:
-                        yield *update_board_buttons(), f"AI played position {ai_action}. Your turn!", reasoning, stats
-                except Exception as e:
-                    yield *update_board_buttons(), f"Error: {str(e)}", "", stats
-            def reset_tictactoe(stats):
-                """Reset TicTacToe game."""
-                tictactoe_env.reset()
-                return *update_board_buttons(), "New game started! You are ❌ (X). Click a square to play.", "AI will show its reasoning here...", stats
-            # Initialize the board on startup
             tictactoe_env.reset()
-            # Simplified layout focusing only on TicTacToe
-            with gr.Row():
-                gr.Markdown("### Play TicTacToe against AI")
-                gr.Markdown("") # spacer
-                ttt_reset_btn = gr.Button("🔄 New Game", variant="secondary", size="sm")
-            gr.Markdown("You are ❌ (X) and go first. Click on a square to make your move.")
-            # Game board centered
-            with gr.Column(elem_classes=["ttt-board"]):
-                board_buttons = []
-                for i in range(3):
-                    with gr.Row(elem_classes=["ttt-row"]):
-                        for j in range(3):
-                            pos = i * 3 + j
-                            button = gr.Button("", elem_id=f"ttt-cell-{pos}", size="lg", value="")
-                            board_buttons.append(button)
-            # Stats display centered below board
-            with gr.Row():
-                ttt_stats_display = gr.Markdown(value="**Wins: 0 | Losses: 0 | Draws: 0**", elem_classes=["ttt-stats"])
-            ttt_message = gr.Textbox(
-                label="Game Status",
-                value="Choose a position to start!",
-                lines=2,
-                interactive=False
-            )
-            ttt_reasoning = gr.Textbox(
-                label="AI Reasoning",
-                value="AI will explain its thought process here...",
-                lines=3,
-                interactive=False
-            )
-            # Create a combined click handler
-            def on_board_click(pos, stats):
-                yield from play_tictactoe(pos, stats)
-            for i in range(9):
-                board_buttons[i].click(
-                    fn=on_board_click,
-                    inputs=[gr.State(i), ttt_stats],
-                    outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
-                )
-            ttt_reset_btn.click(
-                fn=reset_tictactoe,
-                inputs=[ttt_stats],
                 outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
             )
-            # Update stats display on changes
-            ttt_stats.change(
-                fn=lambda s: f"Wins: {s['wins']} | Losses: {s['losses']} | Draws: {s['draws']}",
-                inputs=ttt_stats,
-                outputs=ttt_stats_display
-            )
-            # Initialize board display on load
-            demo.load(
-                fn=lambda stats: (*update_board_buttons(), "Game ready! You are ❌ (X). Click a square to play.", "AI will show its reasoning here...", stats),
-                inputs=[ttt_stats],
-                outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
-            )
-            gr.Markdown("---")
-            gr.Markdown("🚧 **This is a development preview.** Full SPIRAL training and reasoning capabilities will be added in the next update!")
-        else:
-            # Fallback interface when games don't load
-            gr.Markdown("⚠️ **Game modules could not be loaded.** Showing diagnostic information.")
-            gr.Markdown("This usually happens when dependencies are still installing on HF Spaces.")
-            # Show diagnostic info
-            gr.Markdown("### 🔍 Diagnostic Information:")
-            gr.Markdown(f"- Current directory: `{current_dir}`")
-            gr.Markdown(f"- Source path: `{src_path}`")
-            gr.Markdown(f"- Source directory exists: `{os.path.exists(src_path)}`")
-            if os.path.exists(src_path):
-                games_path = os.path.join(src_path, 'games')
-                gr.Markdown(f"- Games directory exists: `{os.path.exists(games_path)}`")
-                if os.path.exists(games_path):
-                    gr.Markdown(f"- Games directory contents: `{os.listdir(games_path)}`")
-            # Simple demo interface
-            with gr.Row():
-                simple_input = gr.Textbox(label="Test Input", placeholder="Enter something...")
-                simple_output = gr.Textbox(label="Output", interactive=False)
-            def simple_echo(text):
-                return f"Echo: {text} (Game modules will be available once dependencies install)"
-            simple_input.submit(fn=simple_echo, inputs=[simple_input], outputs=[simple_output])
-        # About Tab (always available)
-        with gr.TabItem("ℹ️ About"):
-            gr.Markdown("""
-            ### About SPIRAL
-            This is a **demo version** of the SPIRAL methodology: *"Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning."*
-            **Current Features:**
-            - 🎯 **TicTacToe**: Play against a random AI opponent
-            - 🃏 **Kuhn Poker**: Experience simplified poker gameplay
-            - 🎮 **Interactive Games**: Real-time game state updates
-            **Coming Soon:**
-            - 🧠 **SPIRAL-trained AI**: Opponents trained via self-play
-            - 📊 **Reasoning Traces**: See step-by-step AI decision-making
-            - 🔬 **Transfer Learning**: Test AI reasoning on math problems
-            - 📈 **Performance Metrics**: Track AI improvement over time
-            **Game Rules:**
-            **TicTacToe:**
-            - 3x3 grid, get 3 in a row to win
-            - You are X, AI is O
-            - Numbers 0-8 represent board positions
-            **Kuhn Poker:**
-            - 3 cards: Jack (lowest), Queen, King (highest)
-            - Each player gets 1 card, antes 1 chip
-            - Actions: Check/Call, Bet (+1 chip), Fold
-            - Higher card wins if both call/check
-            **Technical Details:**
-            - Built with Gymnasium environments
-            - Gradio web interface
-            - Ready for SPIRAL training integration
-            """)
-            gr.Markdown("**New in this version:** Visual boards, stats tracking, and transfer test stub!")
-        if not GAMES_AVAILABLE:
-            gr.Markdown("---")
-            gr.Markdown("🔄 **Dependencies are loading.** Check the diagnostic info above and refresh in a few minutes!")
-    return demo
-@spaces.GPU(duration=300)
-def main():
-    """
-    Main function to load model, create interface, and launch the Gradio app.
-    Wrapped with @spaces.GPU to allocate a GPU for this Space.
-    """
-    global model, tokenizer
-    print("🚀 Starting main application...")
-    print("Loading configuration...")
-    with open('config.yaml', 'r') as f:
-        config = yaml.safe_load(f)
-    model_name = config['model']['name']
-    quantization_params = config['model'].get('quantization', {})
-    print(f"📦 Model Name: {model_name}")
-    print(f"⚙️ Quantization Params: {quantization_params}")
-    # Create BitsAndBytesConfig if quantization is enabled
-    if quantization_params and quantization_params.get('load_in_4bit'):
-        print("💡 4-bit quantization enabled. Creating BitsAndBytesConfig...")
-        compute_dtype_str = quantization_params.get("bnb_4bit_compute_dtype", "float16")
-        if compute_dtype_str == "bfloat16":
-            compute_dtype = torch.bfloat16
-        else:
-            compute_dtype = torch.float16  # Default to float16
-        bnb_config = BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_quant_type=quantization_params.get("bnb_4bit_quant_type", "nf4"),
-            bnb_4bit_compute_dtype=compute_dtype,
-            bnb_4bit_use_double_quant=quantization_params.get("bnb_4bit_use_double_quant", True),
-        )
-        # Using device_map="auto" is recommended for multi-GPU setups and large models
-        print("🧠 Loading 4-bit quantized model...")
-        model = AutoModelForCausalLM.from_pretrained(
-            model_name,
-            quantization_config=bnb_config,
-            device_map="auto"
-        )
-    else:
-        print("🧠 Loading model without quantization...")
-        # Fallback for no quantization
-        model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
-    print("✒️ Loading tokenizer...")
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    print("✅ Model and tokenizer loaded successfully.")
-    print("🎨 Creating Gradio interface...")
     demo = create_interface()
-    print("🚀 Launching Gradio app...")
     demo.launch()
-if __name__ == "__main__":
-    main()

 """
 SPIRAL: Interactive Reasoning Game Simulator
+Demonstrates key concepts from "Self-Play in Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning"
+This simplified demo shows how strategic reasoning emerges from self-play in zero-sum games like TicTacToe.
 """
 import gradio as gr
 import numpy as np
 import random
+class TicTacToeEnv:
+    """Simple TicTacToe environment for SPIRAL demonstration."""
+    def __init__(self):
+        self.reset()
+    def reset(self):
+        """Reset the game to initial state."""
+        self.board = np.zeros((3, 3), dtype=np.int8)
+        self.current_player = 1  # Player 1 starts (X)
+        self.game_over = False
+        self.winner = None
+        self.move_count = 0
+        return self.board.copy()
+    def step(self, action):
+        """Execute one step in the environment."""
+        if self.game_over:
+            return self.board.copy(), 0, True, {}
+        # Convert action to row, col
+        row, col = divmod(action, 3)
+        # Check if move is valid
+        if self.board[row, col] != 0:
+            return self.board.copy(), -1, True, {"invalid_move": True}
+        # Make the move
+        self.board[row, col] = self.current_player
+        self.move_count += 1
+        # Check for win
+        winner = self._check_winner()
+        if winner is not None:
+            self.game_over = True
+            self.winner = winner
+            reward = 1 if winner == self.current_player else -1
+            return self.board.copy(), reward, True, {}
+        elif self.move_count >= 9:
+            # Draw
+            self.game_over = True
+            return self.board.copy(), 0, True, {}
+        else:
+            # Game continues
+            self.current_player *= -1  # Switch player
+            return self.board.copy(), 0, False, {}
+    def _check_winner(self):
+        """Check if there's a winner."""
+        # Check rows
+        for row in range(3):
+            if abs(self.board[row, :].sum()) == 3:
+                return self.board[row, 0]
+        # Check columns
+        for col in range(3):
+            if abs(self.board[:, col].sum()) == 3:
+                return self.board[0, col]
+        # Check diagonals
+        if abs(self.board.diagonal().sum()) == 3:
+            return self.board[0, 0]
+        if abs(np.fliplr(self.board).diagonal().sum()) == 3:
+            return self.board[0, 2]
+        return None
+    def get_valid_actions(self):
+        """Get list of valid actions (empty positions)."""
+        valid_actions = []
+        for i in range(9):
+            row, col = divmod(i, 3)
+            if self.board[row, col] == 0:
+                valid_actions.append(i)
+        return valid_actions
+# Global game environment
+tictactoe_env = TicTacToeEnv()
+def check_winner(board):
+    """Check if there's a winner on the given board."""
+    # Check rows
+    for row in range(3):
+        if abs(board[row, :].sum()) == 3:
+            return board[row, 0]
+    # Check columns
+    for col in range(3):
+        if abs(board[:, col].sum()) == 3:
+            return board[0, col]
+    # Check diagonals
+    if abs(board.diagonal().sum()) == 3:
+        return board[0, 0]
+    if abs(np.fliplr(board).diagonal().sum()) == 3:
+        return board[0, 2]
+    return None
+def get_valid_moves(board):
+    """Get valid moves for the given board."""
+    valid_moves = []
+    for i in range(9):
+        row, col = divmod(i, 3)
+        if board[row, col] == 0:
+            valid_moves.append(i)
+    return valid_moves
+def minimax(board, player, depth=0):
+    """Minimax algorithm - demonstrates strategic reasoning."""
+    # Base cases
+    winner = check_winner(board)
+    if winner == 1:  # Human wins
+        return -10 + depth, None
+    elif winner == -1:  # AI wins
+        return 10 - depth, None
+    elif len(get_valid_moves(board)) == 0:  # Draw
+        return 0, None
+    best_move = None
+    if player == -1:  # AI is maximizing player
+        best_score = -float('inf')
+        for move in get_valid_moves(board):
+            row, col = divmod(move, 3)
+            board[row, col] = -1
+            score, _ = minimax(board.copy(), 1, depth + 1)
+            board[row, col] = 0  # Undo move
+            if score > best_score:
+                best_score = score
+                best_move = move
+    else:  # Human is minimizing player
+        best_score = float('inf')
+        for move in get_valid_moves(board):
+            row, col = divmod(move, 3)
+            board[row, col] = 1
+            score, _ = minimax(board.copy(), -1, depth + 1)
+            board[row, col] = 0  # Undo move
+            if score < best_score:
+                best_score = score
+                best_move = move
+    return best_score, best_move
+def generate_reasoning(board_state, human_move, ai_move):
+    """Generate reasoning explanation based on game state."""
+    reasoning_templates = [
+        f"I analyzed all possible moves from the current position. After you played position {human_move}, I considered {len(get_valid_moves(board_state))} possible responses. Using minimax tree search, I determined that position {ai_move} gives me the best strategic advantage.",
+        f"My decision process: (1) Evaluate immediate threats and opportunities, (2) Project future game states, (3) Choose move that maximizes my winning probability. Position {ai_move} emerged as optimal after analyzing the full game tree.",
+        f"Strategic analysis: Your move at {human_move} created a new board configuration. I used recursive tree search to evaluate all possible future sequences. Position {ai_move} either creates a winning opportunity or blocks your potential victories.",
+        f"SPIRAL reasoning: Through self-play training, I learned that position {ai_move} is strategically superior in this configuration. This demonstrates how strategic reasoning emerges from multi-agent interaction in zero-sum games."
+    ]
+    return random.choice(reasoning_templates)
 def create_interface():
         }
     """
+    with gr.Blocks(title="SPIRAL: Self-Play Reasoning Demo", theme=gr.themes.Soft(), css=css) as demo:
+        gr.Markdown("# 🎮 SPIRAL: Self-Play Reasoning Demo")
+        gr.Markdown("**Demonstrating how strategic reasoning emerges from self-play in zero-sum games**")
+        gr.Markdown("*Based on: \"Self-Play in Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning\"*")
+        def update_board_buttons():
+            """Create a list of gr.Button updates from the current board state."""
+            updates = []
+            for i in range(9):
+                row, col = divmod(i, 3)
+                cell = tictactoe_env.board[row, col]
+                val = ""
+                interactive = True
+                if cell == 1:
+                    val = '❌'
+                    interactive = False
+                elif cell == -1:
+                    val = '⭕'
+                    interactive = False
+                if tictactoe_env.game_over:
+                    interactive = False
+                updates.append(gr.Button(value=val, interactive=interactive))
+            return updates
+        ttt_stats = gr.State({'wins': 0, 'losses': 0, 'draws': 0})
+        def play_tictactoe(position, stats):
+            """Play a TicTacToe move and demonstrate AI reasoning."""
+            if tictactoe_env.game_over:
+                yield *update_board_buttons(), "Game is over! Click 'New Game' to start again.", "", stats
+                return
+            try:
+                position = int(position)
+                # Human move
+                board_state, reward, done, info = tictactoe_env.step(position)
+                if done:
+                    if info.get("invalid_move"):
+                        yield *update_board_buttons(), "Invalid move! Try again.", "", stats
+                        return
+                    winner = "You" if tictactoe_env.winner == 1 else "AI" if tictactoe_env.winner == -1 else "Draw"
+                    if winner == "You": stats['wins'] += 1
+                    elif winner == "AI": stats['losses'] += 1
+                    else: stats['draws'] += 1
+                    yield *update_board_buttons(), f"Game Over! {winner} won!", "", stats
                     return
+                # Show AI thinking
+                yield *update_board_buttons(), "AI is analyzing the game tree...", "🧠 Strategic reasoning in progress...", stats
+                # AI move using minimax
+                _, ai_action = minimax(tictactoe_env.board.copy(), -1)
+                if ai_action is None:
+                    valid_actions = tictactoe_env.get_valid_actions()
+                    if not valid_actions:
+                        yield *update_board_buttons(), "Game is a draw!", "", stats
                         return
+                    ai_action = random.choice(valid_actions)
+                # Generate reasoning explanation
+                reasoning = generate_reasoning(tictactoe_env.board.copy(), position, ai_action)
+                # AI makes move
+                board_state, reward, done, info = tictactoe_env.step(ai_action)
+                if done:
+                    winner = "You" if tictactoe_env.winner == 1 else "AI" if tictactoe_env.winner == -1 else "Draw"
+                    if winner == "You": stats['wins'] += 1
+                    elif winner == "AI": stats['losses'] += 1
+                    else: stats['draws'] += 1
+                    yield *update_board_buttons(), f"Game Over! {winner} won! AI played position {ai_action}.", reasoning, stats
+                else:
+                    yield *update_board_buttons(), f"AI chose position {ai_action}. Your turn!", reasoning, stats
+            except Exception as e:
+                yield *update_board_buttons(), f"Error: {str(e)}", "", stats
+        def reset_tictactoe(stats):
+            """Reset TicTacToe game."""
             tictactoe_env.reset()
+            return *update_board_buttons(), "New game started! You are ❌ (X). Click a square to demonstrate strategic reasoning.", "The AI will explain its strategic decision-making process...", stats
+        # Initialize the board
+        tictactoe_env.reset()
+        # Game interface
+        with gr.Row():
+            gr.Markdown("### Strategic TicTacToe")
+            gr.Markdown("") # spacer
+            ttt_reset_btn = gr.Button("🔄 New Game", variant="secondary", size="sm")
+        gr.Markdown("**You are ❌ (X)** - The AI uses minimax tree search to demonstrate strategic reasoning")
+        # Game board
+        with gr.Column(elem_classes=["ttt-board"]):
+            board_buttons = []
+            for i in range(3):
+                with gr.Row(elem_classes=["ttt-row"]):
+                    for j in range(3):
+                        pos = i * 3 + j
+                        button = gr.Button("", elem_id=f"ttt-cell-{pos}", size="lg", value="")
+                        board_buttons.append(button)
+        # Stats display
+        with gr.Row():
+            ttt_stats_display = gr.Markdown(value="**Wins: 0 | Losses: 0 | Draws: 0**", elem_classes=["ttt-stats"])
+        # Game status and AI reasoning
+        ttt_message = gr.Textbox(
+            label="🎯 Game Status",
+            value="Click a square to start! Watch how the AI reasons strategically.",
+            lines=2,
+            interactive=False
+        )
+        ttt_reasoning = gr.Textbox(
+            label="🧠 AI Strategic Reasoning",
+            value="The AI will explain its strategic decision-making process here, demonstrating how reasoning emerges from self-play training in zero-sum games.",
+            lines=4,
+            interactive=False
+        )
+        # Event handlers
+        def on_board_click(pos, stats):
+            yield from play_tictactoe(pos, stats)
+        for i in range(9):
+            board_buttons[i].click(
+                fn=on_board_click,
+                inputs=[gr.State(i), ttt_stats],
                 outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
             )
+        ttt_reset_btn.click(
+            fn=reset_tictactoe,
+            inputs=[ttt_stats],
+            outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
+        )
+        # Update stats display
+        ttt_stats.change(
+            fn=lambda s: f"**Wins: {s['wins']} | Losses: {s['losses']} | Draws: {s['draws']}**",
+            inputs=ttt_stats,
+            outputs=ttt_stats_display
+        )
+        # Initialize board display on load
+        demo.load(
+            fn=lambda stats: (*update_board_buttons(), "Click a square to start! Watch how the AI reasons strategically.", "The AI will explain its strategic decision-making process here, demonstrating how reasoning emerges from self-play training in zero-sum games.", stats),
+            inputs=[ttt_stats],
+            outputs=[*board_buttons, ttt_message, ttt_reasoning, ttt_stats]
+        )
+        # Key concepts section
+        gr.Markdown("---")
+        gr.Markdown("## 🧠 Key SPIRAL Concepts Demonstrated")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("""
+                **🎯 Strategic Reasoning**
+                - AI uses minimax tree search
+                - Evaluates all possible future moves
+                - Chooses optimal strategic actions
+                """)
+            with gr.Column():
+                gr.Markdown("""
+                **🔄 Self-Play Learning**
+                - Strategic patterns emerge from competition
+                - Zero-sum games incentivize reasoning
+                - Multi-agent interactions develop intelligence
+                """)
+        gr.Markdown("""
+        ### About SPIRAL
+        This demo illustrates key findings from the SPIRAL research:
+        - **Zero-sum games** like TicTacToe create competitive pressure that incentivizes strategic thinking
+        - **Self-play training** allows AI agents to discover optimal strategies through repeated interaction
+        - **Multi-turn reasoning** emerges naturally from the need to plan ahead in strategic environments
+        - **Tree search algorithms** like minimax demonstrate how strategic reasoning can be formalized and executed
+        The AI's explanations show how it evaluates different moves, considers future possibilities, and makes strategic decisions - core capabilities that transfer to general reasoning tasks.
+        """)
+    return demo
+if __name__ == "__main__":
     demo = create_interface()
     demo.launch()

config.yaml DELETED Viewed

@@ -1,124 +0,0 @@
-# SPIRAL Interactive Reasoning Game Simulator Configuration
-# Model Configuration
-model:
-  name: "meta-llama/Llama-3.1-8B-Instruct"
-  max_length: 2048
-  temperature: 0.7
-  do_sample: true
-  quantization:
-    load_in_4bit: true
-    bnb_4bit_compute_dtype: "float16"
-    bnb_4bit_use_double_quant: true
-# Games Configuration
-games:
-  kuhn_poker:
-    name: "Kuhn Poker"
-    max_rounds: 50
-    deck_size: 3
-    betting_rounds: 2
-  tictactoe:
-    name: "TicTacToe"
-    board_size: 3
-    max_moves: 9
-    win_condition: 3
-# Training Configuration
-training:
-  algorithm: "PPO"
-  episodes: 1000
-  batch_size: 32
-  learning_rate: 0.0003
-  gamma: 0.99
-  gae_lambda: 0.95
-  clip_range: 0.2
-  entropy_coef: 0.01
-  value_loss_coef: 0.5
-  max_grad_norm: 0.5
-  # Self-play specific
-  self_play:
-    update_opponent_every: 100
-    opponent_pool_size: 5
-  # Role-conditioned advantage estimation
-  rae:
-    enable: true
-    role_embedding_dim: 64
-    advantage_weighting: 0.5
-# Reasoning Configuration
-reasoning:
-  enable_traces: true
-  trace_depth: 3
-  chain_of_thought: true
-  explanation_length: 150
-  # Transfer learning evaluation
-  transfer_tasks:
-    - "GSM8K"
-    - "Logic Puzzles"
-    - "Strategic Reasoning"
-# Web Interface Configuration
-interface:
-  title: "SPIRAL: Interactive Reasoning Game Simulator"
-  description: "Play games against AI and explore reasoning capabilities"
-  theme: "default"
-  # Gradio settings
-  gradio:
-    share: false
-    inbrowser: true
-    server_name: "0.0.0.0"
-    server_port: 7860
-    enable_queue: true
-    max_threads: 4
-# Logging Configuration
-logging:
-  level: "INFO"
-  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-  file: "logs/spiral.log"
-  # Experiment tracking
-  wandb:
-    enable: false
-    project: "spiral-reasoning"
-    entity: "your-username"
-  tensorboard:
-    enable: true
-    log_dir: "logs/tensorboard"
-# Data Configuration
-data:
-  cache_dir: "data/cache"
-  datasets_dir: "data/datasets"
-  models_dir: "models"
-  # Benchmark datasets
-  benchmarks:
-    gsm8k: "data/benchmarks/gsm8k.json"
-    logic_puzzles: "data/benchmarks/logic_puzzles.json"
-# Deployment Configuration
-deployment:
-  huggingface:
-    space_name: "kaushikvr06/reasoning-simulator"
-    private: false
-  # Performance settings
-  performance:
-    max_concurrent_users: 10
-    timeout_seconds: 30
-    memory_limit: "2GB"
-# Debug Configuration
-debug:
-  enable: false
-  verbose_traces: false
-  save_game_logs: true
-  profile_inference: false

data/README.md DELETED Viewed

@@ -1,16 +0,0 @@
-# Data Directory
-This directory contains datasets and game-related files for the SPIRAL project.
-## Structure
-- `games/` - Game datasets and rule definitions
-- `benchmarks/` - Math and logic benchmarks for transfer testing (e.g., GSM8K)
-- `training/` - Training data and logs
-- `examples/` - Example game sessions and reasoning traces
-## Data Sources
-- Game implementations from GitHub repositories
-- Math benchmarks like GSM8K for transfer evaluation
-- Custom game datasets generated during training

requirements.txt CHANGED Viewed

@@ -1,15 +1,2 @@
-torch>=2.0.0
-transformers>=4.30.0
-gymnasium>=0.29.0
-stable-baselines3>=2.0.0
-gradio>=4.0.0
-numpy>=1.21.0
-matplotlib>=3.5.0
-seaborn>=0.11.0
-pandas>=1.3.0
-tqdm>=4.62.0
-pyyaml
-bitsandbytes
-accelerate>=0.26.0
-pytest
-Jinja2


1	+ gradio==4.44.0
2	+ numpy==1.24.3

src/__init__.py DELETED Viewed

@@ -1,15 +0,0 @@
-"""
-SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning
-This package implements the SPIRAL methodology for training AI agents
-through self-play on zero-sum games to improve reasoning capabilities.
-"""
-__version__ = "0.1.0"
-__author__ = "SPIRAL Team"
-__email__ = "contact@spiral-reasoning.com"
-from .games import *
-from .models import *
-from .training import *
-from .reasoning import *

src/games/__init__.py DELETED Viewed

@@ -1,16 +0,0 @@
-"""
-Game environments for SPIRAL training.
-This module contains implementations of zero-sum games used for
-self-play training, including Kuhn Poker and TicTacToe.
-"""
-from .tictactoe import TicTacToeEnv, create_tictactoe_env
-from .kuhn_poker import KuhnPokerEnv, create_kuhn_poker_env
-__all__ = [
-    "TicTacToeEnv",
-    "KuhnPokerEnv",
-    "create_tictactoe_env",
-    "create_kuhn_poker_env"
-]

src/games/game_utils.py DELETED Viewed

@@ -1,212 +0,0 @@
-"""
-Game utility functions for SPIRAL training.
-This module contains helper functions for game environments,
-including multi-turn logic and game state management.
-"""
-import gymnasium as gym
-from typing import Dict, Any, Type, Union
-import numpy as np
-from .tictactoe import TicTacToeEnv
-from .kuhn_poker import KuhnPokerEnv
-# Game registry
-GAMES_REGISTRY: Dict[str, Type[gym.Env]] = {
-    "tictactoe": TicTacToeEnv,
-    "kuhn_poker": KuhnPokerEnv,
-}
-def create_game_env(game_name: str, **kwargs) -> gym.Env:
-    """
-    Create a game environment by name.
-    Args:
-        game_name: Name of the game ("tictactoe", "kuhn_poker")
-        **kwargs: Additional arguments for the environment
-    Returns:
-        Game environment instance
-    Raises:
-        ValueError: If game_name is not recognized
-    """
-    if game_name not in GAMES_REGISTRY:
-        available_games = list(GAMES_REGISTRY.keys())
-        raise ValueError(f"Unknown game: {game_name}. Available games: {available_games}")
-    game_class = GAMES_REGISTRY[game_name]
-    return game_class(**kwargs)
-def get_game_info(game_name: str) -> Dict[str, Any]:
-    """
-    Get information about a game environment.
-    Args:
-        game_name: Name of the game
-    Returns:
-        Dictionary with game information
-    """
-    env = create_game_env(game_name)
-    info = {
-        "name": game_name,
-        "action_space": env.action_space,
-        "observation_space": env.observation_space,
-        "max_episode_steps": getattr(env, "_max_episode_steps", None),
-        "render_modes": env.metadata.get("render_modes", []),
-    }
-    # Add game-specific information
-    if game_name == "tictactoe":
-        info.update({
-            "description": "3x3 TicTacToe game with alternating turns",
-            "players": 2,
-            "zero_sum": True,
-            "perfect_information": True,
-        })
-    elif game_name == "kuhn_poker":
-        info.update({
-            "description": "Simplified poker with 3 cards (J, Q, K)",
-            "players": 2,
-            "zero_sum": True,
-            "perfect_information": False,
-        })
-    env.close()
-    return info
-def get_available_games() -> list:
-    """Get list of available game names."""
-    return list(GAMES_REGISTRY.keys())
-def is_game_over(env: gym.Env) -> bool:
-    """
-    Check if the game is over.
-    Args:
-        env: Game environment
-    Returns:
-        True if game is over, False otherwise
-    """
-    if hasattr(env, 'game_over'):
-        return env.game_over
-    return False
-def get_valid_actions(env: gym.Env) -> list:
-    """
-    Get valid actions for the current state.
-    Args:
-        env: Game environment
-    Returns:
-        List of valid actions
-    """
-    if hasattr(env, '_get_valid_actions'):
-        return env._get_valid_actions()
-    elif hasattr(env, 'get_valid_actions'):
-        return env.get_valid_actions()
-    else:
-        # Fallback: assume all actions are valid
-        return list(range(env.action_space.n))
-def get_action_mask(env: gym.Env) -> np.ndarray:
-    """
-    Get action mask for the current state.
-    Args:
-        env: Game environment
-    Returns:
-        Boolean mask where True indicates valid actions
-    """
-    if hasattr(env, 'get_action_mask'):
-        return env.get_action_mask()
-    else:
-        # Fallback: create mask from valid actions
-        valid_actions = get_valid_actions(env)
-        mask = np.zeros(env.action_space.n, dtype=bool)
-        for action in valid_actions:
-            mask[action] = True
-        return mask
-def play_random_game(game_name: str, render: bool = False, seed: int = None) -> Dict[str, Any]:
-    """
-    Play a random game to completion.
-    Args:
-        game_name: Name of the game to play
-        render: Whether to render the game
-        seed: Random seed for reproducibility
-    Returns:
-        Dictionary with game results
-    """
-    env = create_game_env(game_name, render_mode="human" if render else None)
-    if seed is not None:
-        env.reset(seed=seed)
-    else:
-        env.reset()
-    if render:
-        env.render()
-    total_reward = 0
-    step_count = 0
-    actions_taken = []
-    while not is_game_over(env):
-        valid_actions = get_valid_actions(env)
-        action = np.random.choice(valid_actions)
-        obs, reward, terminated, truncated, info = env.step(action)
-        actions_taken.append(action)
-        total_reward += reward
-        step_count += 1
-        if render:
-            print(f"Step {step_count}: Action {action}, Reward: {reward}")
-            env.render()
-        if terminated or truncated:
-            break
-    results = {
-        "game_name": game_name,
-        "total_reward": total_reward,
-        "step_count": step_count,
-        "actions_taken": actions_taken,
-        "winner": getattr(env, 'winner', None),
-        "final_info": info
-    }
-    env.close()
-    return results
-if __name__ == "__main__":
-    # Test the utilities
-    print("Available games:", get_available_games())
-    for game_name in get_available_games():
-        print(f"\n{game_name.upper()} Info:")
-        info = get_game_info(game_name)
-        for key, value in info.items():
-            print(f"  {key}: {value}")
-    # Play a random game
-    print("\nPlaying random TicTacToe game:")
-    result = play_random_game("tictactoe", render=True, seed=42)

src/games/kuhn_poker.py DELETED Viewed

@@ -1,314 +0,0 @@
-"""
-Kuhn Poker Game Environment
-A simple Kuhn Poker implementation using Gymnasium for SPIRAL training.
-Kuhn Poker is a simplified poker variant with 3 cards (J, Q, K).
-"""
-import gymnasium as gym
-import numpy as np
-from gymnasium import spaces
-from typing import Tuple, Dict, Any, Optional, List
-import random
-class KuhnPokerEnv(gym.Env):
-    """
-    Kuhn Poker environment for SPIRAL training.
-    Rules:
-    - 3 cards: Jack (0), Queen (1), King (2)
-    - Each player gets 1 card
-    - Each player antes 1 chip
-    - Player 1 acts first: Check or Bet
-    - Player 2 then acts: Check, Call, or Fold
-    - If both check, high card wins
-    - If one bets and other calls, high card wins
-    - If one bets and other folds, bettor wins
-    Action space: [Check/Call=0, Bet=1, Fold=2]
-    Observation space: [player_card, opponent_action, betting_round]
-    """
-    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 1}
-    # Card values: Jack=0, Queen=1, King=2
-    JACK, QUEEN, KING = 0, 1, 2
-    CARDS = [JACK, QUEEN, KING]
-    CARD_NAMES = ["J", "Q", "K"]
-    # Actions
-    CHECK_CALL, BET, FOLD = 0, 1, 2
-    ACTION_NAMES = ["Check/Call", "Bet", "Fold"]
-    def __init__(self, render_mode: Optional[str] = None):
-        super().__init__()
-        # Observation: [player_card, opponent_last_action, betting_round, pot_size]
-        self.observation_space = spaces.Box(
-            low=0, high=10, shape=(4,), dtype=np.int8
-        )
-        # Actions: Check/Call, Bet, Fold
-        self.action_space = spaces.Discrete(3)
-        self.render_mode = render_mode
-        self.reset()
-    def reset(self, seed: Optional[int] = None, options: Optional[Dict] = None) -> Tuple[np.ndarray, Dict]:
-        """Reset the game to initial state."""
-        super().reset(seed=seed)
-        # Deal cards
-        cards = self.CARDS.copy()
-        random.shuffle(cards)
-        self.player1_card = cards[0]
-        self.player2_card = cards[1]
-        # Game state
-        self.current_player = 1  # Player 1 starts
-        self.pot = 2  # Each player antes 1
-        self.player1_bet = 1  # Ante
-        self.player2_bet = 1  # Ante
-        self.game_over = False
-        self.winner = None
-        self.betting_round = 0
-        self.actions_history = []
-        observation = self._get_observation()
-        info = self._get_info()
-        return observation, info
-    def step(self, action: int) -> Tuple[np.ndarray, float, bool, bool, Dict]:
-        """
-        Execute one step in the environment.
-        Args:
-            action: 0=Check/Call, 1=Bet, 2=Fold
-        Returns:
-            observation, reward, terminated, truncated, info
-        """
-        if self.game_over:
-            raise ValueError("Game is already over. Call reset() to start new game.")
-        # Record action
-        self.actions_history.append((self.current_player, action))
-        # Process action
-        if action == self.FOLD:
-            # Current player folds, opponent wins
-            self.game_over = True
-            self.winner = 2 if self.current_player == 1 else 1
-            reward = self._calculate_reward()
-        elif action == self.BET:
-            # Current player bets
-            if self.current_player == 1:
-                self.player1_bet += 1
-                self.pot += 1
-            else:
-                self.player2_bet += 1
-                self.pot += 1
-            # Check if this ends the betting round
-            if self.betting_round == 0:
-                # First bet, opponent gets to act
-                self.current_player = 2
-                self.betting_round = 1
-                reward = 0.0
-            else:
-                # Second bet (raise), go to showdown
-                self.game_over = True
-                self.winner = self._determine_winner_by_cards()
-                reward = self._calculate_reward()
-        else:  # CHECK_CALL
-            if self.betting_round == 0:
-                # First action is check
-                if self.current_player == 1:
-                    # Player 1 checks, player 2 acts
-                    self.current_player = 2
-                    self.betting_round = 1
-                    reward = 0.0
-                else:
-                    # Player 2 checks after player 1 checked, showdown
-                    self.game_over = True
-                    self.winner = self._determine_winner_by_cards()
-                    reward = self._calculate_reward()
-            else:
-                # This is a call
-                if self.current_player == 2:
-                    # Player 2 calls player 1's bet
-                    self.player2_bet = self.player1_bet
-                    self.pot = self.player1_bet + self.player2_bet
-                    self.game_over = True
-                    self.winner = self._determine_winner_by_cards()
-                    reward = self._calculate_reward()
-                else:
-                    # Player 1 calls player 2's bet
-                    self.player1_bet = self.player2_bet
-                    self.pot = self.player1_bet + self.player2_bet
-                    self.game_over = True
-                    self.winner = self._determine_winner_by_cards()
-                    reward = self._calculate_reward()
-        observation = self._get_observation()
-        info = self._get_info()
-        return observation, reward, self.game_over, False, info
-    def _get_observation(self) -> np.ndarray:
-        """Get current observation for the current player."""
-        # Get current player's card
-        player_card = self.player1_card if self.current_player == 1 else self.player2_card
-        # Get opponent's last action (if any)
-        opponent_last_action = -1
-        if self.actions_history:
-            for player, action in reversed(self.actions_history):
-                if player != self.current_player:
-                    opponent_last_action = action
-                    break
-        # Observation: [player_card, opponent_last_action, betting_round, pot_size]
-        observation = np.array([
-            player_card,
-            opponent_last_action + 1,  # -1 becomes 0, 0 becomes 1, etc.
-            self.betting_round,
-            self.pot
-        ], dtype=np.int8)
-        return observation
-    def _get_info(self) -> Dict[str, Any]:
-        """Get additional info about the game state."""
-        return {
-            "current_player": self.current_player,
-            "game_over": self.game_over,
-            "winner": self.winner,
-            "player1_card": self.player1_card,
-            "player2_card": self.player2_card,
-            "pot": self.pot,
-            "betting_round": self.betting_round,
-            "actions_history": self.actions_history.copy(),
-            "valid_actions": self._get_valid_actions()
-        }
-    def _get_valid_actions(self) -> List[int]:
-        """Get list of valid actions."""
-        if self.game_over:
-            return []
-        # All actions are always valid in Kuhn Poker
-        return [self.CHECK_CALL, self.BET, self.FOLD]
-    def _determine_winner_by_cards(self) -> int:
-        """Determine winner by comparing cards."""
-        if self.player1_card > self.player2_card:
-            return 1
-        else:
-            return 2
-    def _calculate_reward(self) -> float:
-        """Calculate reward for the current player."""
-        if not self.game_over:
-            return 0.0
-        if self.winner == self.current_player:
-            # Won - get the pot minus what you put in
-            if self.current_player == 1:
-                return float(self.pot - self.player1_bet)
-            else:
-                return float(self.pot - self.player2_bet)
-        else:
-            # Lost - lose what you put in
-            if self.current_player == 1:
-                return float(-self.player1_bet)
-            else:
-                return float(-self.player2_bet)
-    def render(self) -> Optional[np.ndarray]:
-        """Render the game state."""
-        if self.render_mode == "human":
-            self._render_human()
-        elif self.render_mode == "rgb_array":
-            return self._render_rgb_array()
-    def _render_human(self):
-        """Print the game state to console."""
-        print("\n" + "="*40)
-        print("KUHN POKER")
-        print("="*40)
-        print(f"Player 1 Card: {self.CARD_NAMES[self.player1_card]}")
-        print(f"Player 2 Card: {self.CARD_NAMES[self.player2_card]}")
-        print(f"Pot: {self.pot}")
-        print(f"Current Player: {self.current_player}")
-        print(f"Betting Round: {self.betting_round}")
-        if self.actions_history:
-            print("Actions:")
-            for player, action in self.actions_history:
-                print(f"  Player {player}: {self.ACTION_NAMES[action]}")
-        if self.game_over:
-            print(f"Game Over! Winner: Player {self.winner}")
-        print("="*40)
-    def _render_rgb_array(self) -> np.ndarray:
-        """Render as RGB array for visualization."""
-        # Simple RGB representation (placeholder)
-        rgb = np.zeros((100, 100, 3), dtype=np.uint8)
-        # Color based on current player's card
-        if self.current_player == 1:
-            card_value = self.player1_card
-        else:
-            card_value = self.player2_card
-        # Different colors for different cards
-        if card_value == self.JACK:
-            rgb[:, :] = [255, 0, 0]  # Red for Jack
-        elif card_value == self.QUEEN:
-            rgb[:, :] = [0, 255, 0]  # Green for Queen
-        else:  # King
-            rgb[:, :] = [0, 0, 255]  # Blue for King
-        return rgb
-    def get_action_mask(self) -> np.ndarray:
-        """Get mask of valid actions (1 for valid, 0 for invalid)."""
-        mask = np.zeros(3, dtype=np.int8)
-        for action in self._get_valid_actions():
-            mask[action] = 1
-        return mask
-def create_kuhn_poker_env() -> KuhnPokerEnv:
-    """Factory function to create a Kuhn Poker environment."""
-    return KuhnPokerEnv()
-if __name__ == "__main__":
-    # Test the environment
-    env = KuhnPokerEnv(render_mode="human")
-    # Play a simple game
-    obs, info = env.reset()
-    print("Initial state:")
-    env.render()
-    # Simulate some moves
-    while not env.game_over:
-        valid_actions = env._get_valid_actions()
-        action = random.choice(valid_actions)
-        obs, reward, terminated, truncated, info = env.step(action)
-        print(f"\nPlayer {env.current_player if not env.game_over else 'Previous'} action: {env.ACTION_NAMES[action]}")
-        print(f"Reward: {reward}")
-        env.render()
-        if terminated:
-            print(f"Game terminated! Final reward: {reward}")
-            break

src/games/tictactoe.py DELETED Viewed

@@ -1,237 +0,0 @@
-"""
-TicTacToe Game Environment
-A simple TicTacToe implementation using Gymnasium for SPIRAL training.
-"""
-import gymnasium as gym
-import numpy as np
-from gymnasium import spaces
-from typing import Tuple, Dict, Any, Optional
-class TicTacToeEnv(gym.Env):
-    """
-    TicTacToe environment for SPIRAL training.
-    - 3x3 grid
-    - Players alternate turns (1 and -1)
-    - Action space: 9 positions (0-8)
-    - Observation space: 3x3 grid with values {-1, 0, 1}
-    - Reward: +1 for win, -1 for loss, 0 for draw/ongoing
-    """
-    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 1}
-    def __init__(self, render_mode: Optional[str] = None):
-        super().__init__()
-        # 3x3 grid, each cell can be -1 (player 2), 0 (empty), or 1 (player 1)
-        self.observation_space = spaces.Box(
-            low=-1, high=1, shape=(3, 3), dtype=np.int8
-        )
-        # 9 possible actions (positions 0-8)
-        self.action_space = spaces.Discrete(9)
-        self.render_mode = render_mode
-        self.reset()
-    def reset(self, seed: Optional[int] = None, options: Optional[Dict] = None) -> Tuple[np.ndarray, Dict]:
-        """Reset the game to initial state."""
-        super().reset(seed=seed)
-        # Initialize empty board
-        self.board = np.zeros((3, 3), dtype=np.int8)
-        self.current_player = 1  # Player 1 starts
-        self.game_over = False
-        self.winner = None
-        self.move_count = 0
-        observation = self._get_observation()
-        info = self._get_info()
-        return observation, info
-    def step(self, action: int) -> Tuple[np.ndarray, float, bool, bool, Dict]:
-        """
-        Execute one step in the environment.
-        Args:
-            action: Position to place mark (0-8)
-        Returns:
-            observation, reward, terminated, truncated, info
-        """
-        if self.game_over:
-            raise ValueError("Game is already over. Call reset() to start new game.")
-        # Convert action to row, col
-        row, col = divmod(action, 3)
-        # Check if move is valid
-        if self.board[row, col] != 0:
-            # Invalid move - penalize and end game
-            reward = -1.0
-            terminated = True
-            self.game_over = True
-            info = self._get_info()
-            info["invalid_move"] = True
-            return self._get_observation(), reward, terminated, False, info
-        # Make the move
-        self.board[row, col] = self.current_player
-        self.move_count += 1
-        # Check for win
-        winner = self._check_winner()
-        if winner is not None:
-            self.game_over = True
-            self.winner = winner
-            reward = 1.0 if winner == self.current_player else -1.0
-            terminated = True
-        elif self.move_count >= 9:
-            # Draw
-            self.game_over = True
-            reward = 0.0
-            terminated = True
-        else:
-            # Game continues
-            reward = 0.0
-            terminated = False
-            self.current_player *= -1  # Switch player
-        observation = self._get_observation()
-        info = self._get_info()
-        return observation, reward, terminated, False, info
-    def _get_observation(self) -> np.ndarray:
-        """Get current board state."""
-        return self.board.copy()
-    def _get_info(self) -> Dict[str, Any]:
-        """Get additional info about the game state."""
-        return {
-            "current_player": self.current_player,
-            "game_over": self.game_over,
-            "winner": self.winner,
-            "move_count": self.move_count,
-            "valid_actions": self._get_valid_actions()
-        }
-    def _get_valid_actions(self) -> list:
-        """Get list of valid actions (empty positions)."""
-        valid_actions = []
-        for i in range(9):
-            row, col = divmod(i, 3)
-            if self.board[row, col] == 0:
-                valid_actions.append(i)
-        return valid_actions
-    def _check_winner(self) -> Optional[int]:
-        """
-        Check if there's a winner.
-        Returns:
-            1 if player 1 wins, -1 if player 2 wins, None if no winner
-        """
-        # Check rows
-        for row in range(3):
-            if abs(self.board[row, :].sum()) == 3:
-                return self.board[row, 0]
-        # Check columns
-        for col in range(3):
-            if abs(self.board[:, col].sum()) == 3:
-                return self.board[0, col]
-        # Check diagonals
-        if abs(self.board.diagonal().sum()) == 3:
-            return self.board[0, 0]
-        if abs(np.fliplr(self.board).diagonal().sum()) == 3:
-            return self.board[0, 2]
-        return None
-    def render(self) -> Optional[np.ndarray]:
-        """Render the game state."""
-        if self.render_mode == "human":
-            self._render_human()
-        elif self.render_mode == "rgb_array":
-            return self._render_rgb_array()
-    def _render_human(self):
-        """Print the board to console."""
-        print("\n" + "="*13)
-        for row in range(3):
-            print("|", end="")
-            for col in range(3):
-                cell = self.board[row, col]
-                if cell == 1:
-                    print(" X ", end="|")
-                elif cell == -1:
-                    print(" O ", end="|")
-                else:
-                    print(f" {row*3 + col} ", end="|")
-            print()
-            print("="*13)
-        if self.game_over:
-            if self.winner is not None:
-                winner_symbol = "X" if self.winner == 1 else "O"
-                print(f"Game Over! Winner: {winner_symbol}")
-            else:
-                print("Game Over! It's a draw!")
-    def _render_rgb_array(self) -> np.ndarray:
-        """Render as RGB array for visualization."""
-        # Simple RGB representation
-        rgb = np.zeros((3, 3, 3), dtype=np.uint8)
-        # Player 1 (X) = Red, Player 2 (O) = Blue, Empty = White
-        for row in range(3):
-            for col in range(3):
-                if self.board[row, col] == 1:
-                    rgb[row, col] = [255, 0, 0]  # Red
-                elif self.board[row, col] == -1:
-                    rgb[row, col] = [0, 0, 255]  # Blue
-                else:
-                    rgb[row, col] = [255, 255, 255]  # White
-        return rgb
-    def get_action_mask(self) -> np.ndarray:
-        """Get mask of valid actions (1 for valid, 0 for invalid)."""
-        mask = np.zeros(9, dtype=np.int8)
-        for action in self._get_valid_actions():
-            mask[action] = 1
-        return mask
-def create_tictactoe_env() -> TicTacToeEnv:
-    """Factory function to create a TicTacToe environment."""
-    return TicTacToeEnv()
-if __name__ == "__main__":
-    # Test the environment
-    env = TicTacToeEnv(render_mode="human")
-    # Play a simple game
-    obs, info = env.reset()
-    print("Initial state:")
-    env.render()
-    # Make some moves
-    moves = [0, 4, 1, 3, 2]  # X wins
-    for move in moves:
-        if not env.game_over:
-            obs, reward, terminated, truncated, info = env.step(move)
-            print(f"\nMove: {move}, Reward: {reward}")
-            env.render()
-            if terminated:
-                print(f"Game terminated! Final reward: {reward}")
-                break

src/models/__init__.py DELETED Viewed

@@ -1,13 +0,0 @@
-"""
-SPIRAL model implementations.
-This module contains the core SPIRAL model architecture and
-role-conditioned advantage estimation (RAE) components.
-"""
-from .spiral_model import SpiralModel
-from .rae import RoleConditionedAdvantageEstimator
-from .policy_network import PolicyNetwork
-from .value_network import ValueNetwork
-__all__ = ["SpiralModel", "RoleConditionedAdvantageEstimator", "PolicyNetwork", "ValueNetwork"]

src/reasoning/__init__.py DELETED Viewed

@@ -1,13 +0,0 @@
-"""
-Reasoning trace generation and analysis.
-This module handles the generation of step-by-step reasoning traces
-during gameplay and transfer to non-game tasks.
-"""
-from .trace_generator import TraceGenerator
-from .chain_of_thought import ChainOfThought
-from .transfer_evaluator import TransferEvaluator
-from .reasoning_utils import ReasoningUtils
-__all__ = ["TraceGenerator", "ChainOfThought", "TransferEvaluator", "ReasoningUtils"]

src/training/__init__.py DELETED Viewed

@@ -1,13 +0,0 @@
-"""
-Training components for SPIRAL.
-This module implements the self-play training logic using PPO
-with role-conditioned advantage estimation.
-"""
-from .self_play_trainer import SelfPlayTrainer
-from .ppo_trainer import PPOTrainer
-from .opponent_manager import OpponentManager
-from .training_utils import TrainingUtils
-__all__ = ["SelfPlayTrainer", "PPOTrainer", "OpponentManager", "TrainingUtils"]

src/training/train_spiral.py DELETED Viewed

@@ -1,58 +0,0 @@
-import os
-import torch
-import gymnasium as gym
-import numpy as np
-from stable_baselines3 import PPO
-from stable_baselines3.common.vec_env import DummyVecEnv
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import yaml
-# Load config
-with open('../../config.yaml', 'r') as f:
-    config = yaml.safe_load(f)
-model_name = config['model']['name']
-max_length = config['model']['max_length']
-# Load base LLM (quantized)
-model = AutoModelForCausalLM.from_pretrained(model_name, **config['model']['quantization'])
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-# Custom Policy with RAE (simplified)
-class SpiralPolicy(torch.nn.Module):
-    def __init__(self, observation_space, action_space):
-        super().__init__()
-        self.role_embed = torch.nn.Embedding(2, 64)  # 0: player, 1: opponent
-        # Add more layers as needed
-    def forward(self, obs, role):
-        # Condition on role
-        role_emb = self.role_embed(role)
-        # Compute policy/value (placeholder)
-        return policy, value
-def train_spiral(game='tictactoe', episodes=1000):
-    if game == 'tictactoe':
-        from src.games.tictactoe import TicTacToeEnv
-        env_fn = lambda: TicTacToeEnv()
-    else:
-        raise ValueError('Game not supported yet')
-    env = DummyVecEnv([env_fn])
-    # PPO with custom policy
-    model = PPO('MlpPolicy', env, verbose=1, learning_rate=0.0003)
-    # Self-play loop (simplified: train against current self)
-    for ep in range(episodes):
-        model.learn(total_timesteps=1000)  # Train batch
-        # Simulate self-play by cloning or saving opponent policy
-        print(f'Episode {ep}: Trained')
-    # Save model
-    os.makedirs('../../models', exist_ok=True)
-    model.save('../../models/spiral_tictactoe.zip')
-    print('Model saved!')
-if __name__ == '__main__':
-    train_spiral()

tests/test_basic.py DELETED Viewed

@@ -1,130 +0,0 @@
-"""
-Basic tests for SPIRAL Interactive Reasoning Game Simulator.
-This module contains fundamental tests to verify the core functionality
-of the SPIRAL system components.
-"""
-import pytest
-import os
-import sys
-import yaml
-# Add the src directory to the path for imports
-sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'src'))
-sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'app'))
-from app import SpiralApp
-class TestSpiralApp:
-    """Test cases for the main SPIRAL application."""
-    def test_app_initialization(self):
-        """Test that the app initializes correctly."""
-        app = SpiralApp()
-        assert app is not None
-        assert hasattr(app, 'config')
-        assert hasattr(app, 'play_game')
-        assert hasattr(app, 'test_reasoning')
-    def test_config_loading(self):
-        """Test configuration loading."""
-        app = SpiralApp()
-        assert 'interface' in app.config
-        assert 'games' in app.config
-        assert app.config['interface']['title'] is not None
-    def test_play_game_basic(self):
-        """Test basic game play functionality."""
-        app = SpiralApp()
-        # Test with valid input
-        state, response, trace = app.play_game("kuhn_poker", "bet", "")
-        assert state is not None
-        assert response is not None
-        assert trace is not None
-        assert "bet" in state
-        # Test with empty input
-        state, response, trace = app.play_game("kuhn_poker", "", "")
-        assert "Please enter a move!" in response
-    def test_reasoning_basic(self):
-        """Test basic reasoning functionality."""
-        app = SpiralApp()
-        # Test with valid input
-        response, trace = app.test_reasoning("What is 2+2?", "math")
-        assert response is not None
-        assert trace is not None
-        assert "2+2" in response
-        # Test with empty input
-        response, trace = app.test_reasoning("", "math")
-        assert "Please enter a reasoning prompt!" in response
-    def test_interface_creation(self):
-        """Test that the Gradio interface can be created."""
-        app = SpiralApp()
-        demo = app.create_interface()
-        assert demo is not None
-class TestConfiguration:
-    """Test cases for configuration management."""
-    def test_config_file_structure(self):
-        """Test that config.yaml has the expected structure."""
-        config_path = os.path.join(os.path.dirname(__file__), '..', 'config.yaml')
-        if os.path.exists(config_path):
-            with open(config_path, 'r') as f:
-                config = yaml.safe_load(f)
-            # Check required sections
-            assert 'model' in config
-            assert 'games' in config
-            assert 'training' in config
-            assert 'reasoning' in config
-            assert 'interface' in config
-            # Check model configuration
-            assert 'name' in config['model']
-            assert 'max_length' in config['model']
-            # Check games configuration
-            assert 'kuhn_poker' in config['games']
-            assert 'tictactoe' in config['games']
-class TestProjectStructure:
-    """Test cases for project structure and imports."""
-    def test_src_directory_structure(self):
-        """Test that the src directory has the expected structure."""
-        src_path = os.path.join(os.path.dirname(__file__), '..', 'src')
-        # Check that required directories exist
-        assert os.path.exists(os.path.join(src_path, 'games'))
-        assert os.path.exists(os.path.join(src_path, 'models'))
-        assert os.path.exists(os.path.join(src_path, 'training'))
-        assert os.path.exists(os.path.join(src_path, 'reasoning'))
-        # Check that __init__.py files exist
-        assert os.path.exists(os.path.join(src_path, '__init__.py'))
-        assert os.path.exists(os.path.join(src_path, 'games', '__init__.py'))
-        assert os.path.exists(os.path.join(src_path, 'models', '__init__.py'))
-        assert os.path.exists(os.path.join(src_path, 'training', '__init__.py'))
-        assert os.path.exists(os.path.join(src_path, 'reasoning', '__init__.py'))
-    def test_required_files_exist(self):
-        """Test that required project files exist."""
-        project_root = os.path.join(os.path.dirname(__file__), '..')
-        # Check essential files
-        assert os.path.exists(os.path.join(project_root, 'requirements.txt'))
-        assert os.path.exists(os.path.join(project_root, 'README.md'))
-        assert os.path.exists(os.path.join(project_root, 'config.yaml'))
-        assert os.path.exists(os.path.join(project_root, '.gitignore'))
-        assert os.path.exists(os.path.join(project_root, 'app', 'app.py'))
-if __name__ == "__main__":
-    pytest.main([__file__])

tests/test_games.py DELETED Viewed

@@ -1,78 +0,0 @@
-import pytest
-import numpy as np
-import sys
-import os
-# Add src to path to allow importing TicTacToeEnv
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../src')))
-from games.tictactoe import TicTacToeEnv
-@pytest.fixture
-def env():
-    """Fixture to create a fresh TicTacToeEnv for each test."""
-    return TicTacToeEnv()
-def test_initial_state(env):
-    """Test the initial state of the board."""
-    assert np.all(env.board == np.zeros((3, 3)))
-    assert env.current_player == 1
-    assert not env.game_over
-def test_player_move(env):
-    """Test a valid player move."""
-    env.step(0)
-    assert env.board[0, 0] == 1
-    assert env.current_player == -1
-    assert not env.game_over
-def test_invalid_move(env):
-    """Test making an invalid move on an occupied cell."""
-    env.step(0)
-    with pytest.raises(ValueError):
-        env.step(0)
-def test_win_condition_row(env):
-    """Test a win condition in a row."""
-    env.board = np.array([[1, 1, 1], [0, -1, 0], [-1, 0, 0]])
-    assert env._check_winner(1)
-    assert not env._check_winner(-1)
-def test_win_condition_col(env):
-    """Test a win condition in a column."""
-    env.board = np.array([[-1, 1, 0], [-1, 1, 0], [-1, 0, 0]])
-    assert not env._check_winner(1)
-    assert env._check_winner(-1)
-def test_win_condition_diag(env):
-    """Test a win condition on a diagonal."""
-    env.board = np.array([[1, 0, -1], [0, 1, -1], [0, 0, 1]])
-    assert env._check_winner(1)
-def test_draw_condition(env):
-    """Test a draw condition."""
-    env.board = np.array([[1, -1, 1], [1, -1, 1], [-1, 1, -1]])
-    assert env._is_draw()
-    assert not env._check_winner(1)
-    assert not env._check_winner(-1)
-def test_game_over_on_win(env):
-    """Test that the game_over flag is set on a win."""
-    env.step(0) # P1
-    env.step(3) # P2
-    env.step(1) # P1
-    env.step(4) # P2
-    _, _, terminated, _, _ = env.step(2) # P1 wins
-    assert terminated
-    assert env.game_over
-    assert env.winner == 1
-def test_reset(env):
-    """Test if the environment resets correctly."""
-    env.step(0)
-    env.step(1)
-    env.reset()
-    assert np.all(env.board == np.zeros((3, 3)))
-    assert env.current_player == 1
-    assert not env.game_over
-    assert env.winner is None