Spaces:

lemms
/

llm

Runtime error

App Files Files Community

lemms commited on Aug 22, 2025

Commit

85c2681

verified ·

1 Parent(s): a67c931

Upload app.py with huggingface_hub

Browse files

Files changed (1) hide show

app.py +426 -1361

app.py CHANGED Viewed

@@ -1,1361 +1,426 @@
-#!/usr/bin/env python3
-"""
-OpenLLM Real Models App - Ultimate Working Version with Correct lm_head Bias Handling
-This is the FINAL WORKING VERSION of the OpenLLM Real Models inference application that has been
-extensively debugged and optimized to correctly load and run the actual trained OpenLLM models
-from Hugging Face Hub.
-CRITICAL ARCHITECTURE MATCHING:
-- The GPT model architecture EXACTLY matches the saved state_dict from the trained models
-- All layer naming conventions use the 'transformer.' prefix (wte, wpe, h, ln_f)
-- Custom transformer blocks (Block, CausalSelfAttention, MLP) replace generic nn.TransformerEncoderLayer
-- Attention bias is correctly handled as causal attention masks (register_buffer) not learnable parameters
-- Language model head (lm_head) uses bias=False to match the saved model's architecture
-- All attribute naming conflicts have been resolved (use_bias vs bias)
-MODEL LOADING PROCESS:
-1. Download model files from Hugging Face Hub using snapshot_download
-2. Parse config.json to extract model configuration parameters
-3. Create GPTConfig object with exact parameter matching
-4. Initialize GPT model with custom architecture
-5. Load state_dict from best_model.pt (handles model_state_dict wrapper)
-6. Load SentencePiece tokenizer from tokenizer.model
-7. Set model to evaluation mode for inference
-TEXT GENERATION FEATURES:
-- Real-time text generation using actual trained model weights
-- Configurable generation parameters (temperature, top_k, top_p, max_length)
-- Proper tokenization and detokenization using SentencePiece
-- Causal language modeling with attention masking
-- Support for all 5 model variants (4k, 6k, 7k, 8k, 9k training steps)
-TECHNICAL IMPLEMENTATION DETAILS:
-- PyTorch-based transformer architecture with custom attention implementation
-- Gradio web interface for user-friendly model interaction
-- Comprehensive error handling and logging throughout the pipeline
-- Memory-efficient model loading with CPU-only inference
-- Real-time model switching between different training checkpoints
-AUTHOR: Louis Chua Bean Chong
-PROJECT: OpenLLM - Open Source Large Language Model Framework
-LICENSE: GPLv3 - Open Source First Philosophy
-"""
-import gradio as gr
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import json
-import logging
-import sentencepiece as spm
-import math
-from pathlib import Path
-from typing import Dict, Any, Optional
-from huggingface_hub import snapshot_download
-# Set up comprehensive logging for debugging and monitoring
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-class GPTConfig:
-    """
-    GPT Model Configuration Class - Handles All Model Architecture Parameters
-    This class defines the complete configuration for the GPT-style transformer model,
-    including all architectural parameters that determine the model's size, capacity,
-    and behavior. It accepts additional kwargs to handle any extra configuration
-    fields that might be present in the saved model's config.json file.
-    CRITICAL PARAMETERS:
-    - vocab_size: Size of the vocabulary (32,000 for OpenLLM models)
-    - n_layer: Number of transformer layers (6 for small models)
-    - n_head: Number of attention heads (8 for small models)
-    - n_embd: Embedding dimension (512 for small models)
-    - block_size: Maximum sequence length (1024 tokens)
-    - dropout: Dropout rate for regularization (0.1)
-    - bias: Whether to use bias terms in linear layers (True)
-    ARCHITECTURE NOTES:
-    - Small model configuration: 6 layers, 8 heads, 512 dims = 35.8M parameters
-    - This matches the exact architecture used during training
-    - All parameters are carefully tuned for the SQuAD dataset training
-    """
-    def __init__(self, vocab_size=32000, n_layer=6, n_head=8, n_embd=512,
-                 block_size=1024, dropout=0.1, bias=True, **kwargs):
-        # Accept any additional kwargs to handle extra config fields from saved models
-        # This is crucial for loading models that may have additional metadata
-        self.vocab_size = vocab_size
-        self.n_layer = n_layer
-        self.n_head = n_head
-        self.n_embd = n_embd
-        self.block_size = block_size
-        self.dropout = dropout
-        self.bias = bias
-class GPT(nn.Module):
-    """
-    GPT-Style Transformer Model - EXACT Architecture Matching the Saved Model
-    This is the core transformer model that EXACTLY matches the architecture of the
-    trained OpenLLM models. Every layer, every parameter, and every naming convention
-    has been carefully designed to match the saved state_dict from the training process.
-    ARCHITECTURE COMPONENTS:
-    - transformer.wte: Word token embeddings (vocab_size -> n_embd)
-    - transformer.wpe: Position embeddings (block_size -> n_embd)
-    - transformer.drop: Dropout layer for regularization
-    - transformer.h: List of transformer blocks (n_layer count)
-    - transformer.ln_f: Final layer normalization
-    - lm_head: Language model head (n_embd -> vocab_size, NO bias)
-    CRITICAL DESIGN DECISIONS:
-    - Uses nn.ModuleDict for transformer components to match 'transformer.' prefix
-    - Custom Block, CausalSelfAttention, and MLP classes for exact architecture
-    - lm_head.bias = False to match saved model (no bias term)
-    - Proper weight initialization following GPT-style conventions
-    - Causal attention masking for autoregressive generation
-    FORWARD PASS:
-    - Combines token and position embeddings
-    - Processes through transformer blocks with residual connections
-    - Applies final layer normalization
-    - Projects to vocabulary space for next-token prediction
-    GENERATION:
-    - Autoregressive text generation with temperature, top-k, and top-p sampling
-    - Causal attention ensures tokens only attend to previous tokens
-    - Configurable generation parameters for different text styles
-    """
-    def __init__(self, config):
-        super().__init__()
-        # Validate critical configuration parameters
-        assert config.vocab_size is not None, "vocab_size must be specified"
-        assert config.block_size is not None, "block_size must be specified"
-        self.config = config
-        # Create the transformer module with the EXACT naming convention from saved model
-        # This nn.ModuleDict structure is crucial for matching the 'transformer.' prefix
-        # in the saved state_dict keys (transformer.wte.weight, transformer.wpe.weight, etc.)
-        self.transformer = nn.ModuleDict(dict(
-            wte = nn.Embedding(config.vocab_size, config.n_embd),  # Word token embeddings
-            wpe = nn.Embedding(config.block_size, config.n_embd),  # Position embeddings
-            drop = nn.Dropout(config.dropout),                     # Dropout for regularization
-            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),  # Transformer blocks
-            ln_f = nn.LayerNorm(config.n_embd),                   # Final layer normalization
-        ))
-        # Language model head - CRITICAL: NO bias to match saved model architecture
-        # The saved models were trained without bias in the language model head
-        # This is a common practice in transformer language models for efficiency
-        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
-        # Initialize weights using GPT-style initialization
-        # This ensures proper weight scaling and prevents gradient issues
-        self.apply(self._init_weights)
-        for pn, p in self.named_parameters():
-            if pn.endswith('c_proj.weight'):
-                # Special initialization for projection layers in transformer blocks
-                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
-    def _init_weights(self, module):
-        """
-        GPT-Style Weight Initialization for All Model Components
-        This function applies the standard GPT weight initialization strategy:
-        - Linear layers: Normal distribution with mean=0, std=0.02
-        - Embeddings: Normal distribution with mean=0, std=0.02
-        - Bias terms: Zero initialization (when present)
-        This initialization scheme has been proven effective for transformer models
-        and helps with training stability and convergence.
-        """
-        if isinstance(module, nn.Linear):
-            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
-            if module.bias is not None:
-                torch.nn.init.zeros_(module.bias)
-        elif isinstance(module, nn.Embedding):
-            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
-    def forward(self, idx, targets=None):
-        """
-        Forward Pass Through the Complete Transformer Model
-        This is the main inference function that processes input tokens through
-        the entire transformer architecture to produce logits for next-token prediction.
-        ARGUMENTS:
-        - idx: Input token indices (batch_size, sequence_length)
-        - targets: Target token indices for training (optional, for loss computation)
-        PROCESSING STEPS:
-        1. Extract sequence length and validate against block_size
-        2. Create position indices for positional encoding
-        3. Look up token and position embeddings
-        4. Combine embeddings and apply dropout
-        5. Process through all transformer blocks
-        6. Apply final layer normalization
-        7. Project to vocabulary space via language model head
-        RETURNS:
-        - logits: Predicted token probabilities (batch_size, seq_len, vocab_size)
-        - loss: Cross-entropy loss (only if targets provided)
-        NOTE: During inference (targets=None), only the last token's logits are returned
-        for efficient autoregressive generation.
-        """
-        device = idx.device
-        b, t = idx.size()
-        # Validate sequence length against model's maximum block size
-        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
-        # Create position indices for positional encoding
-        # This enables the model to understand token positions in the sequence
-        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)
-        # Look up embeddings for tokens and positions
-        tok_emb = self.transformer.wte(idx)  # Token embeddings
-        pos_emb = self.transformer.wpe(pos)  # Position embeddings
-        # Combine embeddings and apply dropout for regularization
-        x = self.transformer.drop(tok_emb + pos_emb)
-        # Process through all transformer blocks with residual connections
-        for block in self.transformer.h:
-            x = block(x)
-        # Apply final layer normalization
-        x = self.transformer.ln_f(x)
-        # Project to vocabulary space for next-token prediction
-        if targets is not None:
-            # Training mode: compute loss for all positions
-            logits = self.lm_head(x)
-            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
-        else:
-            # Inference mode: only compute logits for the last token (efficient generation)
-            logits = self.lm_head(x[:, [-1], :])
-            loss = None
-        return logits, loss
-    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None, top_p=None, do_sample=True):
-        """
-        Autoregressive Text Generation with Advanced Sampling Strategies
-        This function generates text by repeatedly predicting the next token
-        using the trained model, with configurable sampling parameters for
-        controlling the creativity and coherence of the generated text.
-        GENERATION PROCESS:
-        1. For each new token to generate:
-           a. Forward pass through model to get logits for next token
-           b. Apply temperature scaling to control randomness
-           c. Apply top-k filtering to limit vocabulary choices
-           d. Apply top-p (nucleus) sampling for dynamic vocabulary selection
-           e. Sample next token from filtered probability distribution
-           f. Append to sequence and repeat
-        SAMPLING PARAMETERS:
-        - temperature: Controls randomness (higher = more random, lower = more focused)
-        - top_k: Limits vocabulary to k highest probability tokens
-        - top_p: Nucleus sampling - limits to tokens with cumulative probability <= p
-        - do_sample: Whether to sample (True) or use greedy decoding (False)
-        ATTENTION HANDLING:
-        - Uses causal attention masking to ensure tokens only attend to previous tokens
-        - Automatically handles sequence length limits via block_size
-        - Efficient autoregressive generation with minimal memory usage
-        RETURNS:
-        - Complete token sequence including input and generated tokens
-        """
-        for _ in range(max_new_tokens):
-            # Ensure sequence doesn't exceed model's block size
-            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
-            # Forward pass to get logits for next token
-            logits, _ = self(idx_cond)
-            logits = logits[:, -1, :] / temperature  # Apply temperature scaling
-            # Top-k filtering: keep only the k highest probability tokens
-            if top_k is not None:
-                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
-                logits[logits < v[:, [-1]]] = -float('Inf')
-            # Top-p (nucleus) sampling: keep tokens with cumulative probability <= top_p
-            if top_p is not None:
-                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
-                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
-                sorted_indices_to_remove = cumulative_probs > top_p
-                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
-                sorted_indices_to_remove[..., 0] = 0
-                indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
-                logits[indices_to_remove] = -float('Inf')
-            # Convert logits to probabilities and sample next token
-            probs = F.softmax(logits, dim=-1)
-            if do_sample:
-                # Stochastic sampling for creative text generation
-                idx_next = torch.multinomial(probs, num_samples=1)
-            else:
-                # Greedy decoding for deterministic generation
-                _, idx_next = torch.topk(probs, k=1, dim=-1)
-            # Append new token to sequence
-            idx = torch.cat((idx, idx_next), dim=1)
-        return idx
-class Block(nn.Module):
-    """
-    Transformer Block - Core Building Block of the GPT Architecture
-    Each transformer block implements the standard transformer architecture with:
-    - Multi-head self-attention mechanism for capturing token relationships
-    - Feed-forward neural network for processing attention outputs
-    - Layer normalization for training stability
-    - Residual connections for gradient flow
-    ARCHITECTURE:
-    - ln_1: Pre-attention layer normalization
-    - attn: Multi-head causal self-attention
-    - ln_2: Pre-feedforward layer normalization
-    - mlp: Multi-layer perceptron (feed-forward network)
-    RESIDUAL CONNECTIONS:
-    - x = x + attn(ln_1(x))  # Residual connection around attention
-    - x = x + mlp(ln_2(x))   # Residual connection around feed-forward
-    DESIGN RATIONALE:
-    - Layer normalization is applied BEFORE each sublayer (pre-norm)
-    - This improves training stability and allows deeper networks
-    - Residual connections help with gradient flow during backpropagation
-    - The combination enables effective training of very deep transformer models
-    """
-    def __init__(self, config):
-        super().__init__()
-        self.ln_1 = nn.LayerNorm(config.n_embd)  # Pre-attention normalization
-        self.attn = CausalSelfAttention(config)  # Multi-head causal attention
-        self.ln_2 = nn.LayerNorm(config.n_embd)  # Pre-feedforward normalization
-        self.mlp = MLP(config)                   # Feed-forward network
-    def forward(self, x):
-        """
-        Forward Pass Through a Single Transformer Block
-        This implements the standard transformer block computation with
-        pre-norm layer normalization and residual connections.
-        PROCESSING STEPS:
-        1. Apply layer normalization to input
-        2. Process through multi-head self-attention
-        3. Add residual connection (x + attention_output)
-        4. Apply layer normalization to result
-        5. Process through feed-forward network
-        6. Add residual connection (x + feedforward_output)
-        ARGUMENTS:
-        - x: Input tensor of shape (batch_size, sequence_length, embedding_dim)
-        RETURNS:
-        - Output tensor of same shape as input
-        """
-        # First sublayer: self-attention with residual connection
-        x = x + self.attn(self.ln_1(x))
-        # Second sublayer: feed-forward with residual connection
-        x = x + self.mlp(self.ln_2(x))
-        return x
-class CausalSelfAttention(nn.Module):
-    """
-    Multi-Head Causal Self-Attention - ULTIMATE WORKING VERSION
-    This is the FINAL WORKING VERSION of the attention mechanism that correctly
-    handles the causal attention bias as a buffer (not a learnable parameter).
-    This was a critical fix that resolved the state_dict loading issues.
-    ATTENTION MECHANISM:
-    - Multi-head attention allows the model to attend to different parts of the sequence
-    - Causal masking ensures tokens can only attend to previous tokens (autoregressive)
-    - Query, Key, Value projections from the same input sequence
-    - Scaled dot-product attention with optional dropout
-    CRITICAL FIXES IMPLEMENTED:
-    - Attention bias is correctly handled as a causal mask buffer (register_buffer)
-    - Attribute naming conflict resolved (use_bias vs bias)
-    - Proper attention mask application in forward pass
-    - Exact matching with saved model's attention architecture
-    ARCHITECTURE COMPONENTS:
-    - c_attn: Combined QKV projection (n_embd -> 3*n_embd)
-    - c_proj: Output projection (n_embd -> n_embd)
-    - attn_dropout: Dropout for attention weights
-    - resid_dropout: Dropout for output projection
-    - bias: Causal attention mask (registered as buffer, not parameter)
-    ATTENTION COMPUTATION:
-    1. Project input to Q, K, V vectors
-    2. Reshape for multi-head attention
-    3. Apply scaled dot-product attention with causal masking
-    4. Reshape back to original dimensions
-    5. Apply output projection with dropout
-    """
-    def __init__(self, config):
-        super().__init__()
-        # Validate that embedding dimension is divisible by number of heads
-        assert config.n_embd % config.n_head == 0, "Embedding dimension must be divisible by number of heads"
-        # Attention projections
-        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)  # QKV projection
-        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)      # Output projection
-        # Dropout layers for regularization
-        self.attn_dropout = nn.Dropout(config.dropout)  # Attention weight dropout
-        self.resid_dropout = nn.Dropout(config.dropout) # Output dropout
-        # Store configuration parameters
-        self.n_head = config.n_head      # Number of attention heads
-        self.n_embd = config.n_embd      # Embedding dimension
-        self.dropout = config.dropout    # Dropout rate
-        self.use_bias = config.bias      # Use different name for the boolean flag to avoid conflicts
-        # CRITICAL FIX: REGISTER THE ATTENTION BIAS as a buffer (not parameter)
-        # This is actually an attention mask, not a learnable bias
-        # The saved model stores this as 'bias' in the state_dict
-        if config.bias:
-            # Create a causal attention mask buffer
-            # This is a lower triangular matrix that prevents tokens from attending to future tokens
-            mask = torch.tril(torch.ones(config.block_size, config.block_size))
-            mask = mask.view(1, 1, config.block_size, config.block_size)
-            self.register_buffer('bias', mask)  # This matches the saved model's 'bias' key
-        else:
-            self.register_buffer('bias', None)
-    def forward(self, x):
-        """
-        Forward Pass Through Multi-Head Causal Self-Attention
-        This function implements the complete attention mechanism including:
-        - Query, Key, Value computation from input
-        - Multi-head attention with causal masking
-        - Output projection and dropout
-        ATTENTION STEPS:
-        1. Project input to Q, K, V vectors (combined projection for efficiency)
-        2. Reshape for multi-head attention (separate heads)
-        3. Apply scaled dot-product attention with causal masking
-        4. Reshape back to original dimensions
-        5. Apply output projection with dropout
-        ARGUMENTS:
-        - x: Input tensor of shape (batch_size, sequence_length, embedding_dim)
-        RETURNS:
-        - Output tensor of same shape as input
-        """
-        B, T, C = x.size()  # Batch size, sequence length, embedding dimension
-        # Calculate query, key, values for all heads
-        # This is an efficient combined projection that creates Q, K, V in one operation
-        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
-        # Reshape for multi-head attention
-        # Each head gets a subset of the embedding dimension
-        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
-        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
-        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
-        # Causal self-attention using the bias mask
-        if self.bias is not None:
-            # Use the causal mask - this prevents tokens from attending to future tokens
-            # The mask is a lower triangular matrix where mask[i,j] = 1 if i >= j, 0 otherwise
-            attn_mask = self.bias[:, :, :T, :T]  # Extract mask for current sequence length
-            y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask,
-                                             dropout_p=self.dropout if self.training else 0,
-                                             is_causal=False)  # We provide our own mask
-        else:
-            # Use built-in causal attention (alternative approach)
-            y = F.scaled_dot_product_attention(q, k, v, attn_mask=None,
-                                             dropout_p=self.dropout if self.training else 0,
-                                             is_causal=True)
-        # Reshape back to original dimensions
-        y = y.transpose(1, 2).contiguous().view(B, T, C)
-        # Output projection with dropout
-        y = self.resid_dropout(self.c_proj(y))
-        return y
-class MLP(nn.Module):
-    """
-    Multi-Layer Perceptron - Feed-Forward Network in Transformer Blocks
-    The MLP is the feed-forward component of each transformer block, consisting of:
-    - Two linear transformations with a GELU activation in between
-    - Dropout for regularization
-    - Optional bias terms (controlled by config.bias)
-    ARCHITECTURE:
-    - c_fc: First linear layer (n_embd -> 4*n_embd) - expansion
-    - gelu: GELU activation function
-    - c_proj: Second linear layer (4*n_embd -> n_embd) - projection
-    - dropout: Dropout layer for regularization
-    DESIGN RATIONALE:
-    - The 4x expansion factor is standard in transformer architectures
-    - GELU activation provides smooth gradients and good performance
-    - Dropout prevents overfitting during training
-    - The combination allows the model to learn complex non-linear transformations
-    MATHEMATICAL OPERATION:
-    - x = dropout(linear2(gelu(linear1(x))))
-    - This creates a powerful non-linear transformation for each token
-    """
-    def __init__(self, config):
-        super().__init__()
-        # First linear layer: expand embedding dimension by 4x
-        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
-        # GELU activation function (commonly used in transformers)
-        self.gelu = nn.GELU()
-        # Second linear layer: project back to original embedding dimension
-        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
-        # Dropout for regularization
-        self.dropout = nn.Dropout(config.dropout)
-    def forward(self, x):
-        """
-        Forward Pass Through the Multi-Layer Perceptron
-        This implements the standard feed-forward computation in transformer blocks:
-        1. Expand dimension with first linear layer
-        2. Apply GELU activation
-        3. Project back to original dimension
-        4. Apply dropout for regularization
-        ARGUMENTS:
-        - x: Input tensor of shape (batch_size, sequence_length, embedding_dim)
-        RETURNS:
-        - Output tensor of same shape as input
-        """
-        x = self.c_fc(x)      # Expand: n_embd -> 4*n_embd
-        x = self.gelu(x)      # Apply GELU activation
-        x = self.c_proj(x)    # Project: 4*n_embd -> n_embd
-        x = self.dropout(x)   # Apply dropout for regularization
-        return x
-class RealOpenLLMInference:
-    """
-    Real OpenLLM Inference Engine - Loads and Runs Actual Trained Models
-    This is the core inference engine that handles the complete pipeline for loading
-    and running the actual trained OpenLLM models from Hugging Face Hub. It provides
-    a unified interface for model management, text generation, and parameter control.
-    KEY FEATURES:
-    - Dynamic model loading from Hugging Face Hub repositories
-    - Support for all 5 model variants (4k, 6k, 7k, 8k, 9k training steps)
-    - Comprehensive error handling and logging
-    - Memory-efficient model management
-    - Real-time model switching capabilities
-    MODEL CONFIGURATIONS:
-    - Each model has specific training characteristics and performance metrics
-    - Models are trained on Wikipedia passages from the SQuAD dataset
-    - Architecture: 6 layers, 8 heads, 512 embedding dim, 35.8M parameters
-    - Vocabulary: 32k tokens using SentencePiece BPE tokenization
-    TECHNICAL IMPLEMENTATION:
-    - Uses huggingface_hub.snapshot_download for efficient model downloading
-    - Handles various checkpoint formats (model_state_dict, direct state_dict)
-    - Supports multiple model file formats (best_model.pt, model.pt, pytorch_model.bin)
-    - Implements robust config parsing with fallback defaults
-    - Provides detailed logging for debugging and monitoring
-    MEMORY MANAGEMENT:
-    - Models are loaded on-demand to conserve memory
-    - Supports multiple models in memory simultaneously
-    - Automatic cleanup of temporary download directories
-    - CPU-only inference for compatibility and stability
-    """
-    def __init__(self):
-        """
-        Initialize the Real OpenLLM Inference Engine
-        Sets up the inference engine with model configurations, storage containers,
-        and logging infrastructure. This is the entry point for all model operations.
-        INITIALIZATION COMPONENTS:
-        - models: Dictionary to store loaded model instances
-        - tokenizers: Dictionary to store loaded tokenizer instances
-        - current_model: Tracks the currently active model
-        - model_configs: Complete configuration for all available models
-        MODEL CONFIGURATIONS INCLUDED:
-        - 4k model: Early training stage, basic language understanding
-        - 6k model: Improved coherence, better text generation
-        - 7k model: Enhanced quality with lower perplexity
-        - 8k model: Sophisticated understanding and reasoning
-        - 9k model: Best performing model with highest quality output
-        """
-        # Storage containers for loaded models and tokenizers
-        self.models = {}        # Dictionary: model_id -> GPT model instance
-        self.tokenizers = {}    # Dictionary: model_id -> SentencePiece tokenizer
-        self.current_model = None  # Currently active model ID
-        # Complete configuration for all available real models from Hugging Face
-        # Each model has specific training characteristics and performance metrics
-        self.model_configs = {
-            "openllm-small-extended-4k": {
-                "name": "OpenLLM Small (4k steps)",
-                "description": "Real model trained for 4,000 steps - Early training stage with basic language understanding and simple text generation capabilities. This model represents the initial learning phase where the model begins to understand basic language patterns.",
-                "hf_repo": "lemms/openllm-small-extended-4k",
-                "training_steps": 4000,
-                "parameters": "35.8M"
-            },
-            "openllm-small-extended-6k": {
-                "name": "OpenLLM Small (6k steps)",
-                "description": "Real model trained for 6,000 steps - Improved coherence and better text generation quality. This model shows significant improvement in understanding context and generating more coherent text sequences. Perplexity: 816.040 indicates substantial learning progress.",
-                "hf_repo": "lemms/openllm-small-extended-6k",
-                "training_steps": 6000,
-                "parameters": "35.8M"
-            },
-            "openllm-small-extended-7k": {
-                "name": "OpenLLM Small (7k steps)",
-                "description": "Real model trained for 7,000 steps - Enhanced quality with significantly improved text generation. This model demonstrates much better language understanding with Loss: 2.100 and Perplexity: 8.200, showing excellent training convergence.",
-                "hf_repo": "lemms/openllm-small-extended-7k",
-                "training_steps": 7000,
-                "parameters": "35.8M"
-            },
-            "openllm-small-extended-8k": {
-                "name": "OpenLLM Small (8k steps)",
-                "description": "Real model trained for 8,000 steps - Sophisticated understanding and advanced reasoning capabilities. This model shows deep comprehension of complex language patterns and can generate high-quality, contextually appropriate text.",
-                "hf_repo": "lemms/openllm-small-extended-8k",
-                "training_steps": 8000,
-                "parameters": "35.8M"
-            },
-            "openllm-small-extended-9k": {
-                "name": "OpenLLM Small (9k steps)",
-                "description": "Real model trained for 9,000 steps - Best performing model with highest quality output. This represents the pinnacle of training for the small model architecture, offering the most sophisticated language understanding and generation capabilities.",
-                "hf_repo": "lemms/openllm-small-extended-9k",
-                "training_steps": 9000,
-                "parameters": "35.8M"
-            }
-        }
-        # Initialize logging to track engine startup
-        logger.info("🚀 Real OpenLLM Inference Engine initialized with comprehensive model support")
-    def load_model_from_hf(self, model_id: str) -> bool:
-        """
-        Load a Real Model from Hugging Face Hub
-        This is the main entry point for loading models from Hugging Face Hub.
-        It handles the complete pipeline from repository identification to model
-        initialization, including downloading, configuration parsing, and setup.
-        LOADING PROCESS:
-        1. Validate model_id against available configurations
-        2. Download model files from Hugging Face Hub
-        3. Parse model configuration and architecture
-        4. Initialize GPT model with exact architecture matching
-        5. Load trained weights from checkpoint file
-        6. Initialize SentencePiece tokenizer
-        7. Set model to evaluation mode for inference
-        ERROR HANDLING:
-        - Validates model_id existence before processing
-        - Handles network errors during download
-        - Manages file format variations and parsing issues
-        - Provides detailed error messages for debugging
-        ARGUMENTS:
-        - model_id: String identifier for the model (e.g., "openllm-small-extended-9k")
-        RETURNS:
-        - bool: True if model loaded successfully, False otherwise
-        SIDE EFFECTS:
-        - Downloads model files to temporary directory
-        - Stores model and tokenizer in internal dictionaries
-        - Sets current_model to loaded model_id
-        - Logs detailed progress information
-        """
-        try:
-            # Validate that the requested model exists in our configuration
-            config = self.model_configs.get(model_id)
-            if not config:
-                logger.error(f"❌ Unknown model ID: {model_id} - not found in available configurations")
-                return False
-            logger.info(f"📥 Loading real model from HF: {config['hf_repo']}")
-            # Download model files from Hugging Face Hub
-            # This uses the efficient snapshot_download function that handles caching
-            # and only downloads files that don't already exist locally
-            local_dir = snapshot_download(
-                repo_id=config['hf_repo'],
-                repo_type="model",
-                local_dir=f"temp_{model_id}",
-                allow_patterns=["*.pt", "*.json", "*.model", "*.bin"]  # Only download necessary files
-            )
-            logger.info(f"✅ Downloaded model to: {local_dir}")
-            # Load model and tokenizer from the downloaded directory
-            # This is the core loading function that handles all the technical details
-            success = self._load_model_and_tokenizer(local_dir, model_id)
-            if success:
-                # Update current model tracking
-                self.current_model = model_id
-                logger.info(f"✅ Successfully loaded real model: {model_id}")
-                return True
-            else:
-                logger.error(f"❌ Failed to load model and tokenizer for: {model_id}")
-                return False
-        except Exception as e:
-            # Comprehensive error handling for all potential issues
-            logger.error(f"❌ Failed to load real model from HF {model_id}: {e}")
-            return False
-    def _load_model_and_tokenizer(self, model_dir: str, model_id: str) -> bool:
-        """
-        Load Model and Tokenizer from Local Directory - Core Loading Function
-        This is the core function that handles the technical details of loading
-        the model architecture, weights, and tokenizer from the downloaded files.
-        It implements robust error handling and supports multiple file formats.
-        LOADING STEPS:
-        1. Parse config.json to extract model architecture parameters
-        2. Create GPTConfig object with exact parameter matching
-        3. Initialize GPT model with custom architecture
-        4. Load state_dict from checkpoint file (handles multiple formats)
-        5. Load SentencePiece tokenizer from tokenizer.model
-        6. Set model to evaluation mode for inference
-        CONFIGURATION HANDLING:
-        - Supports both direct config and nested model_config structures
-        - Filters parameters to only include expected GPTConfig fields
-        - Provides fallback defaults for missing configuration files
-        - Handles extra configuration fields gracefully
-        CHECKPOINT FORMATS SUPPORTED:
-        - model_state_dict: Standard PyTorch training checkpoint format
-        - model: Alternative checkpoint key for model weights
-        - Direct state_dict: Raw model weights without wrapper
-        - Multiple file formats: best_model.pt, model.pt, pytorch_model.bin
-        ERROR HANDLING:
-        - Validates file existence before processing
-        - Handles missing configuration files with defaults
-        - Manages state_dict key mismatches and format variations
-        - Provides detailed error messages and file listings
-        ARGUMENTS:
-        - model_dir: Path to directory containing model files
-        - model_id: String identifier for the model being loaded
-        RETURNS:
-        - bool: True if loading successful, False otherwise
-        SIDE EFFECTS:
-        - Stores loaded model in self.models[model_id]
-        - Stores loaded tokenizer in self.tokenizers[model_id]
-        - Logs detailed progress and error information
-        """
-        try:
-            model_path = Path(model_dir)
-            # STEP 1: Load and parse model configuration
-            # The config.json file contains all the architectural parameters
-            config_file = model_path / "config.json"
-            if config_file.exists():
-                # Load configuration data from JSON file
-                with open(config_file, 'r') as f:
-                    config_data = json.load(f)
-                logger.info(f"📋 Config data keys: {list(config_data.keys())}")
-                # Handle different config structures that might be present
-                # Some models store config in a nested 'model_config' section
-                if 'model_config' in config_data:
-                    # Extract model_config section for the actual model parameters
-                    model_config_data = config_data['model_config']
-                    logger.info("🔧 Using nested model_config structure")
-                else:
-                    # Use the entire config as model config (direct structure)
-                    model_config_data = config_data
-                    logger.info("🔧 Using direct config structure")
-                # Create GPTConfig with only the expected parameters
-                # This filters out any extra fields that might cause issues
-                expected_params = {
-                    'vocab_size', 'n_layer', 'n_head', 'n_embd',
-                    'block_size', 'dropout', 'bias'
-                }
-                config_kwargs = {}
-                for key, value in model_config_data.items():
-                    if key in expected_params:
-                        config_kwargs[key] = value
-                logger.info(f"🔧 Using config parameters: {config_kwargs}")
-                model_config = GPTConfig(**config_kwargs)
-            else:
-                # Fallback to default configuration if config file is missing
-                # This ensures the system can still work with incomplete model files
-                logger.warning(f"⚠️ Config file not found, using default configuration")
-                model_config = GPTConfig(
-                    vocab_size=32000,
-                    n_layer=6,
-                    n_head=8,
-                    n_embd=512,
-                    block_size=1024,
-                    dropout=0.1,
-                    bias=True
-                )
-            # STEP 2: Load model weights from checkpoint file
-            # Try multiple possible file names and formats
-            model_file = model_path / "best_model.pt"
-            if not model_file.exists():
-                model_file = model_path / "model.pt"
-            if not model_file.exists():
-                model_file = model_path / "pytorch_model.bin"
-            if model_file.exists():
-                logger.info(f"📦 Loading model from: {model_file}")
-                # Initialize GPT model with the parsed configuration
-                model = GPT(model_config)
-                # Load checkpoint data from file
-                checkpoint = torch.load(model_file, map_location='cpu')
-                # Handle different checkpoint formats that might be present
-                if isinstance(checkpoint, dict):
-                    if 'model_state_dict' in checkpoint:
-                        # Standard PyTorch training checkpoint format
-                        state_dict = checkpoint['model_state_dict']
-                        logger.info(f"📋 Loading from model_state_dict with {len(state_dict)} keys")
-                    elif 'model' in checkpoint:
-                        # Alternative checkpoint key for model weights
-                        state_dict = checkpoint['model']
-                        logger.info(f"📋 Loading from model with {len(state_dict)} keys")
-                    else:
-                        # Try to load directly as state dict
-                        state_dict = checkpoint
-                        logger.info(f"📋 Loading direct state dict with {len(state_dict)} keys")
-                else:
-                    # Direct state dict (no wrapper dictionary)
-                    state_dict = checkpoint
-                    logger.info(f"📋 Loading direct state dict with {len(state_dict)} keys")
-                # Load the state dict into the model
-                # This is where the architecture matching is critical
-                model.load_state_dict(state_dict)
-                # Set model to evaluation mode for inference
-                model.eval()
-                # Store the loaded model in our dictionary
-                self.models[model_id] = model
-                logger.info(f"✅ Model loaded successfully")
-            else:
-                # Handle missing model file
-                logger.error(f"❌ Model file not found in {model_dir}")
-                logger.error(f"   Available files: {list(model_path.glob('*'))}")
-                return False
-            # STEP 3: Load SentencePiece tokenizer
-            # The tokenizer is essential for text tokenization and detokenization
-            tokenizer_file = model_path / "tokenizer.model"
-            if tokenizer_file.exists():
-                # Initialize SentencePiece processor
-                tokenizer = spm.SentencePieceProcessor()
-                # Load the trained tokenizer model
-                tokenizer.load(str(tokenizer_file))
-                # Store the loaded tokenizer in our dictionary
-                self.tokenizers[model_id] = tokenizer
-                logger.info(f"✅ Tokenizer loaded successfully")
-            else:
-                # Handle missing tokenizer file
-                logger.error(f"❌ Tokenizer file not found in {model_dir}")
-                return False
-            # All components loaded successfully
-            return True
-        except Exception as e:
-            # Comprehensive error handling with full traceback
-            logger.error(f"❌ Failed to load model and tokenizer: {e}")
-            import traceback
-            logger.error(f"📋 Full traceback: {traceback.format_exc()}")
-            return False
-    def generate_text(self, prompt: str, max_length: int = 100,
-                     temperature: float = 0.7, top_k: int = 50,
-                     top_p: float = 0.9) -> str:
-        """
-        Generate Text Using the Loaded Real Model
-        This is the main text generation function that uses the loaded model
-        to generate coherent text based on the input prompt. It implements
-        the complete generation pipeline from tokenization to text output.
-        GENERATION PROCESS:
-        1. Validate that a model is currently loaded
-        2. Tokenize the input prompt using SentencePiece
-        3. Convert tokens to PyTorch tensor format
-        4. Generate new tokens using the model's autoregressive generation
-        5. Decode the generated tokens back to text
-        6. Remove the input prompt from the output for clean results
-        GENERATION PARAMETERS:
-        - temperature: Controls randomness (0.1-2.0, higher = more random)
-        - top_k: Limits vocabulary to k highest probability tokens (1-100)
-        - top_p: Nucleus sampling threshold (0.1-1.0, controls diversity)
-        - max_length: Maximum number of new tokens to generate (10-500)
-        SAMPLING STRATEGIES:
-        - Temperature scaling: Adjusts probability distribution sharpness
-        - Top-k filtering: Restricts vocabulary to most likely tokens
-        - Top-p (nucleus) sampling: Dynamic vocabulary selection based on cumulative probability
-        - Combined sampling: All parameters work together for optimal text quality
-        ERROR HANDLING:
-        - Validates model availability before generation
-        - Handles tokenization errors gracefully
-        - Manages generation failures with detailed error messages
-        - Provides fallback responses for error conditions
-        ARGUMENTS:
-        - prompt: Input text that will be used as the generation seed
-        - max_length: Maximum number of new tokens to generate
-        - temperature: Controls randomness in token selection
-        - top_k: Number of highest probability tokens to consider
-        - top_p: Nucleus sampling parameter for dynamic vocabulary selection
-        RETURNS:
-        - str: Generated text continuation (prompt removed for clean output)
-        SIDE EFFECTS:
-        - Logs generation parameters and progress
-        - May trigger model loading if no model is currently active
-        - Provides detailed error information for debugging
-        """
-        # Validate that a model is currently loaded and available
-        if not self.current_model or self.current_model not in self.models:
-            return "❌ No model loaded. Please select a model first."
-        try:
-            # Get the currently loaded model and tokenizer
-            model = self.models[self.current_model]
-            tokenizer = self.tokenizers[self.current_model]
-            # STEP 1: Tokenize the input prompt
-            # Convert text to token IDs using the SentencePiece tokenizer
-            input_ids = tokenizer.encode(prompt)
-            # Convert to PyTorch tensor format for model processing
-            input_tensor = torch.tensor([input_ids], dtype=torch.long)
-            # Log generation parameters for debugging and monitoring
-            logger.info(f"🎯 Generating text with prompt: '{prompt[:50]}...'")
-            logger.info(f"📊 Parameters: max_length={max_length}, temperature={temperature}, top_k={top_k}, top_p={top_p}")
-            # STEP 2: Generate text using the model
-            # Use torch.no_grad() for memory efficiency during inference
-            with torch.no_grad():
-                # Call the model's generate method with all parameters
-                output_ids = model.generate(
-                    input_tensor,
-                    max_new_tokens=max_length,
-                    temperature=temperature,
-                    top_k=top_k,
-                    top_p=top_p,
-                    do_sample=True  # Enable stochastic sampling for creative generation
-                )
-            # STEP 3: Decode the generated tokens back to text
-            # Convert the complete token sequence (input + generated) to text
-            generated_text = tokenizer.decode(output_ids[0].tolist())
-            # STEP 4: Clean up the output by removing the input prompt
-            # This provides a cleaner user experience by showing only the generated continuation
-            if generated_text.startswith(prompt):
-                generated_text = generated_text[len(prompt):].strip()
-            # Log successful generation for monitoring
-            logger.info(f"✅ Generated text: '{generated_text[:100]}...'")
-            return generated_text
-        except Exception as e:
-            # Comprehensive error handling with detailed error messages
-            error_msg = f"❌ Generation failed: {str(e)}"
-            logger.error(error_msg)
-            import traceback
-            logger.error(f"📋 Full traceback: {traceback.format_exc()}")
-            return error_msg
-# Initialize the real inference engine
-# This creates the main inference engine instance that will handle all model operations
-inference_engine = RealOpenLLMInference()
-def load_model_info(model_id: str) -> str:
-    """
-    Get Detailed Information About a Specific Model
-    This function retrieves comprehensive information about a specific model
-    from the inference engine's configuration. It provides detailed descriptions
-    of the model's training characteristics, performance metrics, and capabilities.
-    INFORMATION PROVIDED:
-    - Model name and training step count
-    - Detailed description of model capabilities and characteristics
-    - Parameter count and architecture details
-    - Training progress indicators and performance metrics
-    USAGE:
-    - Called by the Gradio interface to display model information
-    - Updates dynamically when user selects different models
-    - Provides educational content about model differences
-    ARGUMENTS:
-    - model_id: String identifier for the model (e.g., "openllm-small-extended-9k")
-    RETURNS:
-    - str: Formatted markdown string with model information
-    """
-    config = inference_engine.model_configs.get(model_id)
-    if config:
-        # Format comprehensive model information in markdown
-        return f"**{config['name']}**\n\n{config['description']}\n\n**Parameters:** {config['parameters']}\n**Training Steps:** {config['training_steps']:,}"
-    return "❌ Model not found"
-def generate_text_interface(model_id: str, prompt: str, max_length: int,
-                          temperature: float, top_k: int, top_p: float) -> str:
-    """
-    Gradio Interface Function for Text Generation - Main User Interface
-    This is the primary interface function that connects the Gradio web interface
-    to the underlying inference engine. It handles user requests for text generation
-    and manages the complete workflow from model loading to text output.
-    INTERFACE WORKFLOW:
-    1. Receive generation request from Gradio interface
-    2. Check if requested model is already loaded
-    3. Load model if necessary (with progress logging)
-    4. Call the inference engine's text generation function
-    5. Return generated text to the user interface
-    6. Handle any errors and provide user-friendly messages
-    MODEL LOADING STRATEGY:
-    - Models are loaded on-demand to conserve memory
-    - Once loaded, models remain in memory for faster subsequent requests
-    - Automatic model switching when user selects different models
-    - Comprehensive error handling for loading failures
-    GENERATION PARAMETERS:
-    - All parameters are passed through from the Gradio interface
-    - Parameters are validated and logged for debugging
-    - Default values ensure reasonable generation quality
-    ERROR HANDLING:
-    - Graceful handling of model loading failures
-    - User-friendly error messages for interface display
-    - Detailed logging for technical debugging
-    - Fallback responses for various error conditions
-    ARGUMENTS:
-    - model_id: String identifier for the model to use
-    - prompt: Input text prompt for generation
-    - max_length: Maximum number of tokens to generate
-    - temperature: Controls randomness in generation (0.1-2.0)
-    - top_k: Number of highest probability tokens to consider (1-100)
-    - top_p: Nucleus sampling parameter (0.1-1.0)
-    RETURNS:
-    - str: Generated text or error message for display
-    SIDE EFFECTS:
-    - May trigger model loading if model not already in memory
-    - Logs all generation requests and parameters
-    - Updates internal model tracking
-    """
-    try:
-        # Check if the requested model is already loaded in memory
-        if model_id not in inference_engine.models:
-            logger.info(f"🔄 Loading real model: {model_id}")
-            # Load the model from Hugging Face Hub
-            success = inference_engine.load_model_from_hf(model_id)
-            if not success:
-                # Return user-friendly error message if loading fails
-                return f"❌ Failed to load real model: {model_id}"
-        # Generate text using the loaded model with all specified parameters
-        result = inference_engine.generate_text(
-            prompt=prompt,
-            max_length=max_length,
-            temperature=temperature,
-            top_k=top_k,
-            top_p=top_p
-        )
-        # Return the generated text to the Gradio interface
-        return result
-    except Exception as e:
-        # Comprehensive error handling for any unexpected issues
-        error_msg = f"❌ Error in generation interface: {str(e)}"
-        logger.error(error_msg)
-        return error_msg
-# Create Gradio interface
-def create_interface():
-    """
-    Create the Complete Gradio Web Interface
-    This function builds the entire Gradio web interface that provides users
-    with an intuitive way to interact with the OpenLLM models. The interface
-    includes model selection, parameter controls, and text generation capabilities.
-    INTERFACE COMPONENTS:
-    - Header section with project information and model descriptions
-    - Model selection dropdown with detailed information display
-    - Text input area for user prompts
-    - Generation parameter controls (temperature, top-k, top-p, max length)
-    - Generate button for triggering text generation
-    - Output area for displaying generated text
-    - Footer with technical details and model sources
-    LAYOUT DESIGN:
-    - Two-column layout for efficient space utilization
-    - Left column: Model selection and information
-    - Right column: Input controls and generation parameters
-    - Responsive design that works on different screen sizes
-    - Professional styling with Soft theme for modern appearance
-    USER EXPERIENCE FEATURES:
-    - Real-time model information updates
-    - Intuitive parameter controls with helpful descriptions
-    - Clear visual feedback for all user actions
-    - Comprehensive error handling and user guidance
-    - Educational content about model differences and capabilities
-    TECHNICAL INTEGRATION:
-    - Seamless connection to the inference engine
-    - Automatic model loading and switching
-    - Real-time parameter validation and feedback
-    - Comprehensive logging and error reporting
-    - Memory-efficient model management
-    RETURNS:
-    - gr.Blocks: Complete Gradio interface ready for deployment
-    """
-    # Create the main Gradio interface with professional styling
-    with gr.Blocks(
-        title="🚀 OpenLLM Real Models Space",
-        theme=gr.themes.Soft()  # Modern, professional theme
-    ) as interface:
-        # Header section with comprehensive project information
-        gr.Markdown("""
-        # 🚀 OpenLLM Real Models Space
-        Welcome to the OpenLLM Real Models Space! This interface uses **actual trained models** from Hugging Face.
-        ## 🎯 Real Trained Models
-        We provide **5 different real models** with varying training steps:
-        | Model | Training Steps | Parameters | Performance |
-        |-------|---------------|------------|-------------|
-        | **4k Model** | 4,000 | 35.8M | Early training stage |
-        | **6k Model** | 6,000 | 35.8M | Improved coherence (Perplexity: 816.040) |
-        | **7k Model** | 7,000 | 35.8M | Enhanced quality (Loss: 2.100, Perplexity: 8.200) |
-        | **8k Model** | 8,000 | 35.8M | Sophisticated understanding |
-        | **9k Model** | 9,000 | 35.8M | Best performing model |
-        **These are real GPT-style transformer models trained on Wikipedia passages from the SQuAD dataset.**
-        ---
-        """)
-        # Main interface layout with two columns
-        with gr.Row():
-            # Left column: Model selection and information
-            with gr.Column(scale=1):
-                # Model selection dropdown
-                # This allows users to choose between different model variants
-                model_dropdown = gr.Dropdown(
-                    choices=list(inference_engine.model_configs.keys()),  # All available models
-                    value="openllm-small-extended-9k",  # Default to best performing model
-                    label="🎯 Select Model",
-                    info="Choose the real trained model to use"
-                )
-                # Model information display
-                # Shows detailed information about the selected model
-                model_info = gr.Markdown(
-                    value=load_model_info("openllm-small-extended-9k"),  # Default model info
-                    label="📋 Model Information"
-                )
-                # Update model info when selection changes
-                # This provides real-time updates as users switch between models
-                model_dropdown.change(
-                    fn=load_model_info,
-                    inputs=[model_dropdown],
-                    outputs=[model_info]
-                )
-            # Right column: Input controls and generation parameters
-            with gr.Column(scale=2):
-                # Text input area for user prompts
-                # This is where users enter their text for generation
-                prompt_input = gr.Textbox(
-                    lines=5,  # Multi-line input for longer prompts
-                    label="📝 Input Prompt",
-                    placeholder="Enter your text prompt here...",
-                    info="The text that will be used as input for generation"
-                )
-                # Generation parameters in organized rows
-                # First row: Max length and temperature controls
-                with gr.Row():
-                    # Maximum length control
-                    max_length = gr.Slider(
-                        minimum=10,
-                        maximum=500,
-                        value=100,  # Default to reasonable length
-                        step=10,
-                        label="📏 Max Length",
-                        info="Maximum number of tokens to generate"
-                    )
-                    # Temperature control for randomness
-                    temperature = gr.Slider(
-                        minimum=0.1,
-                        maximum=2.0,
-                        value=0.7,  # Default to balanced creativity
-                        step=0.1,
-                        label="🌡️ Temperature",
-                        info="Controls randomness (higher = more random)"
-                    )
-                # Second row: Top-k and top-p controls
-                with gr.Row():
-                    # Top-k filtering control
-                    top_k = gr.Slider(
-                        minimum=1,
-                        maximum=100,
-                        value=50,  # Default to reasonable filtering
-                        step=1,
-                        label="🔝 Top-K",
-                        info="Number of highest probability tokens to consider"
-                    )
-                    # Top-p (nucleus) sampling control
-                    top_p = gr.Slider(
-                        minimum=0.1,
-                        maximum=1.0,
-                        value=0.9,  # Default to high diversity
-                        step=0.1,
-                        label="📊 Top-P",
-                        info="Nucleus sampling parameter"
-                    )
-                # Generate button
-                # This triggers the text generation process
-                generate_btn = gr.Button(
-                    "🚀 Generate Text",
-                    variant="primary",  # Prominent styling
-                    size="lg"  # Large button for easy interaction
-                )
-        # Output area for displaying generated text
-        # This shows the results of the generation process
-        output_text = gr.Textbox(
-            lines=10,  # Large output area for generated text
-            label="🎯 Generated Text",
-            info="The generated text will appear here"
-        )
-        # Connect the generate button to the generation function
-        # This creates the workflow from user input to text output
-        generate_btn.click(
-            fn=generate_text_interface,
-            inputs=[model_dropdown, prompt_input, max_length, temperature, top_k, top_p],
-            outputs=[output_text]
-        )
-        # Footer section with technical details and model sources
-        gr.Markdown("""
-        ---
-        ## 🔧 Technical Details
-        - **Architecture**: GPT-style transformer decoder
-        - **Model Size**: Small (6 layers, 8 heads, 512 embedding dim)
-        - **Vocabulary**: 32k tokens (SentencePiece BPE)
-        - **Training Data**: Wikipedia passages from SQuAD dataset
-        - **Framework**: PyTorch with real trained models
-        - **Gradio Version**: 4.44.1 (latest)
-        **These models generate actual text based on their training on Wikipedia content.**
-        **Model Sources:**
-        - [4k Model](https://huggingface.co/lemms/openllm-small-extended-4k)
-        - [6k Model](https://huggingface.co/lemms/openllm-small-extended-6k)
-        - [7k Model](https://huggingface.co/lemms/openllm-small-extended-7k)
-        - [8k Model](https://huggingface.co/lemms/openllm-small-extended-8k)
-        - [9k Model](https://huggingface.co/lemms/openllm-small-extended-9k)
-        """)
-    return interface
-# Create and launch the interface
-if __name__ == "__main__":
-    """
-    Main Application Entry Point
-    This is the entry point for the Gradio application. It creates the interface
-    and launches the web server for user interaction.
-    LAUNCH CONFIGURATION:
-    - server_name: "0.0.0.0" allows external connections
-    - server_port: 7860 is the standard Gradio port
-    - share: False for local deployment (set to True for public sharing)
-    - debug: True for development logging and error details
-    DEPLOYMENT CONSIDERATIONS:
-    - The application is designed for Hugging Face Spaces deployment
-    - All dependencies are specified in requirements.txt
-    - The interface is optimized for web-based interaction
-    - Error handling is comprehensive for production use
-    TECHNICAL FEATURES:
-    - Automatic model loading and management
-    - Real-time text generation capabilities
-    - Comprehensive parameter controls
-    - Professional user interface design
-    - Robust error handling and logging
-    """
-    # Create the complete Gradio interface
-    interface = create_interface()
-    # Launch the web server with production-ready configuration
-    interface.launch(
-        server_name="0.0.0.0",  # Allow external connections
-        server_port=7860,       # Standard Gradio port
-        share=False,            # Local deployment (set to True for public sharing)
-        debug=True              # Enable debug logging for development
-    )

+#!/usr/bin/env python3
+"""
+OpenLLM Inference Space - Gradio Interface
+Loads models from Hugging Face repositories to avoid storage limits
+"""
+import gradio as gr
+import torch
+import json
+import os
+from pathlib import Path
+from typing import Dict, Any, Optional
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Import our custom modules
+try:
+    from core.src.model import GPTConfig, GPT
+    from core.src.inference_server import OpenLLMInference
+    logger.info("✅ Successfully imported core modules")
+except ImportError as e:
+    logger.error(f"❌ Failed to import core modules: {e}")
+    raise
+class OpenLLMInferenceEngine:
+    """
+    Inference engine that loads models from Hugging Face repositories
+    """
+    def __init__(self):
+        self.models = {}
+        self.tokenizers = {}
+        self.current_model = None
+        self.current_tokenizer = None
+        # Model configurations with Hugging Face repository IDs
+        self.model_configs = {
+            "openllm-small-extended-4k": {
+                "name": "OpenLLM Small (4k steps)",
+                "description": "Small model trained for 4,000 steps - Early training stage",
+                "hf_repo": "lemms/openllm-small-extended-4k",
+                "local_path": "models/small-extended-4k",  # Fallback if HF repo fails
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            },
+            "openllm-small-extended-6k": {
+                "name": "OpenLLM Small (6k steps)",
+                "description": "Small model trained for 6,000 steps - Improved coherence",
+                "hf_repo": "lemms/openllm-small-extended-6k",
+                "local_path": "models/small-extended-6k",
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            },
+            "openllm-small-extended-7k": {
+                "name": "OpenLLM Small (7k steps)",
+                "description": "Small model trained for 7,000 steps - Enhanced quality",
+                "hf_repo": "lemms/openllm-small-extended-7k",
+                "local_path": "models/small-extended-7k",
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            },
+            "openllm-small-extended-8k": {
+                "name": "OpenLLM Small (8k steps)",
+                "description": "Small model trained for 8,000 steps - Sophisticated understanding",
+                "hf_repo": "lemms/openllm-small-extended-8k",
+                "local_path": "models/small-extended-8k",
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            },
+            "openllm-small-extended-9k": {
+                "name": "OpenLLM Small (9k steps)",
+                "description": "Small model trained for 9,000 steps - Best performing model",
+                "hf_repo": "lemms/openllm-small-extended-9k",
+                "local_path": "models/small-extended-9k",
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            },
+            "openllm-small-extended-10k": {
+                "name": "OpenLLM Small (10k steps)",
+                "description": "Small model trained for 10,000 steps - Latest extended training",
+                "hf_repo": "lemms/openllm-small-extended-10k",
+                "local_path": "models/small-extended-10k",
+                "checkpoint": "best_model.pt",
+                "config": "config.json"
+            }
+        }
+        logger.info("🚀 OpenLLM Inference Engine initialized")
+        logger.info(f"📋 Available models: {list(self.model_configs.keys())}")
+    def load_model_from_hf(self, model_id: str) -> bool:
+        """
+        Load model from Hugging Face repository
+        """
+        try:
+            from huggingface_hub import snapshot_download
+            config = self.model_configs.get(model_id)
+            if not config:
+                logger.error(f"❌ Unknown model ID: {model_id}")
+                return False
+            logger.info(f"📥 Loading model from HF: {config['hf_repo']}")
+            # Download model files from Hugging Face
+            local_dir = snapshot_download(
+                repo_id=config['hf_repo'],
+                repo_type="model",
+                local_dir=f"temp_{model_id}",
+                allow_patterns=["*.pt", "*.json", "*.model"]
+            )
+            logger.info(f"✅ Downloaded model to: {local_dir}")
+            # Initialize inference engine with downloaded model
+            inference_engine = OpenLLMInference(
+                model_path=Path(local_dir),
+                model_format="pytorch"
+            )
+            self.models[model_id] = inference_engine
+            self.current_model = model_id
+            logger.info(f"✅ Successfully loaded model: {model_id}")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to load model from HF {model_id}: {e}")
+            # Fallback to local model if available
+            logger.info(f"🔄 Trying fallback to local model...")
+            return self.load_model_local(model_id)
+    def load_model_local(self, model_id: str) -> bool:
+        """
+        Load model from local files (fallback method)
+        """
+        try:
+            config = self.model_configs.get(model_id)
+            if not config:
+                return False
+            local_path = Path(config['local_path'])
+            if not local_path.exists():
+                logger.error(f"❌ Local model path not found: {local_path}")
+                return False
+            logger.info(f"📂 Loading local model: {local_path}")
+            # Initialize inference engine with local model
+            inference_engine = OpenLLMInference(
+                model_path=local_path,
+                model_format="pytorch"
+            )
+            self.models[model_id] = inference_engine
+            self.current_model = model_id
+            logger.info(f"✅ Successfully loaded local model: {model_id}")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to load local model {model_id}: {e}")
+            return False
+    def generate_text(self, prompt: str, max_length: int = 100,
+                     temperature: float = 0.7, top_k: int = 50,
+                     top_p: float = 0.9) -> str:
+        """
+        Generate text using the currently loaded model
+        """
+        if not self.current_model or self.current_model not in self.models:
+            return "❌ No model loaded. Please select a model first."
+        try:
+            logger.info(f"🎯 Generating text with model: {self.current_model}")
+            logger.info(f"📝 Prompt: {prompt[:100]}...")
+            logger.info(f"⚙️  Parameters: max_length={max_length}, temp={temperature}, top_k={top_k}, top_p={top_p}")
+            # Get the inference engine
+            inference_engine = self.models[self.current_model]
+            # Prepare generation parameters
+            generation_params = {
+                "max_length": max_length,
+                "temperature": temperature,
+                "top_k": top_k,
+                "top_p": top_p,
+                "do_sample": True,
+                "pad_token_id": 0,
+                "eos_token_id": 1
+            }
+            # Generate text
+            generated_texts = inference_engine.generate(
+                prompt=prompt,
+                **generation_params
+            )
+            # Extract the generated text
+            if isinstance(generated_texts, list) and len(generated_texts) > 0:
+                generated_text = generated_texts[0]
+            else:
+                generated_text = str(generated_texts)
+            logger.info(f"✅ Generated text length: {len(generated_text)}")
+            return generated_text
+        except Exception as e:
+            error_msg = f"❌ Generation failed: {str(e)}"
+            logger.error(error_msg)
+            return error_msg
+# Initialize the inference engine
+inference_engine = OpenLLMInferenceEngine()
+def load_model_info(model_id: str) -> str:
+    """Get information about a specific model"""
+    config = inference_engine.model_configs.get(model_id)
+    if config:
+        return f"**{config['name']}**\n\n{config['description']}"
+    return "❌ Model not found"
+def generate_text_interface(model_id: str, prompt: str, max_length: int,
+                          temperature: float, top_k: int, top_p: float) -> str:
+    """
+    Gradio interface function for text generation
+    """
+    try:
+        # Load model if not already loaded
+        if model_id not in inference_engine.models:
+            logger.info(f"🔄 Loading model: {model_id}")
+            success = inference_engine.load_model_from_hf(model_id)
+            if not success:
+                return f"❌ Failed to load model: {model_id}"
+        # Generate text
+        result = inference_engine.generate_text(
+            prompt=prompt,
+            max_length=max_length,
+            temperature=temperature,
+            top_k=top_k,
+            top_p=top_p
+        )
+        return result
+    except Exception as e:
+        error_msg = f"❌ Error in generation interface: {str(e)}"
+        logger.error(error_msg)
+        return error_msg
+# Create Gradio interface
+def create_interface():
+    """Create the Gradio interface"""
+    with gr.Blocks(
+        title="🚀 OpenLLM Inference Space",
+        theme=gr.themes.Soft(),
+        css="""
+        .gradio-container {
+            max-width: 1200px !important;
+            margin: auto !important;
+        }
+        .model-info {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            padding: 20px;
+            border-radius: 10px;
+            margin-bottom: 20px;
+        }
+        """
+    ) as interface:
+        # Header
+        gr.Markdown("""
+        # 🚀 OpenLLM Inference Space
+        Welcome to the OpenLLM Inference Space! This is a comprehensive interface for running inference on our trained OpenLLM models with customizable parameters.
+        ## 🎯 Available Models
+        We provide **5 different models** trained for varying numbers of steps:
+        | Model | Training Steps | Description | Best Loss |
+        |-------|---------------|-------------|-----------|
+        | **4k Model** | 4,000 | Early training stage, basic language patterns | ~6.2 |
+        | **6k Model** | 6,000 | Improved coherence, better vocabulary usage | ~5.8 |
+        | **7k Model** | 7,000 | Enhanced text generation quality | ~5.5 |
+        | **8k Model** | 8,000 | More sophisticated language understanding | ~5.3 |
+        | **9k Model** | 9,000 | Best performing model (latest training) | ~5.2 |
+        ---
+        """)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # Model selection
+                model_dropdown = gr.Dropdown(
+                    choices=list(inference_engine.model_configs.keys()),
+                    value="openllm-small-extended-9k",
+                    label="🎯 Select Model",
+                    info="Choose the model to use for inference"
+                )
+                # Model information display
+                model_info = gr.Markdown(
+                    value=load_model_info("openllm-small-extended-9k"),
+                    label="📋 Model Information"
+                )
+                # Update model info when selection changes
+                model_dropdown.change(
+                    fn=load_model_info,
+                    inputs=[model_dropdown],
+                    outputs=[model_info]
+                )
+            with gr.Column(scale=2):
+                # Input prompt
+                prompt_input = gr.Textbox(
+                    lines=5,
+                    label="📝 Input Prompt",
+                    placeholder="Enter your text prompt here...",
+                    info="The text that will be used as input for generation"
+                )
+                # Generation parameters
+                with gr.Row():
+                    max_length = gr.Slider(
+                        minimum=10,
+                        maximum=500,
+                        value=100,
+                        step=10,
+                        label="📏 Max Length",
+                        info="Maximum number of tokens to generate"
+                    )
+                    temperature = gr.Slider(
+                        minimum=0.1,
+                        maximum=2.0,
+                        value=0.7,
+                        step=0.1,
+                        label="🌡️ Temperature",
+                        info="Controls randomness (higher = more random)"
+                    )
+                with gr.Row():
+                    top_k = gr.Slider(
+                        minimum=1,
+                        maximum=100,
+                        value=50,
+                        step=1,
+                        label="🔝 Top-K",
+                        info="Number of highest probability tokens to consider"
+                    )
+                    top_p = gr.Slider(
+                        minimum=0.1,
+                        maximum=1.0,
+                        value=0.9,
+                        step=0.1,
+                        label="📊 Top-P",
+                        info="Nucleus sampling parameter"
+                    )
+                # Generate button
+                generate_btn = gr.Button(
+                    "🚀 Generate Text",
+                    variant="primary",
+                    size="lg"
+                )
+        # Output
+        output_text = gr.Textbox(
+            lines=10,
+            label="🎯 Generated Text",
+            info="The generated text will appear here"
+        )
+        # Connect the generate button
+        generate_btn.click(
+            fn=generate_text_interface,
+            inputs=[model_dropdown, prompt_input, max_length, temperature, top_k, top_p],
+            outputs=[output_text]
+        )
+        # Footer
+        gr.Markdown("""
+        ---
+        ## 🔧 Technical Details
+        - **Architecture**: GPT-style transformer decoder
+        - **Model Size**: Small (6 layers, 8 heads, 512 embedding dim)
+        - **Vocabulary**: 32k tokens (SentencePiece BPE)
+        - **Training Data**: Custom dataset with 27.6M tokens
+        - **Framework**: PyTorch with custom training loop
+        ## 📚 Usage Tips
+        1. **Start with the 9k model** for best results
+        2. **Lower temperature** (0.3-0.7) for more focused output
+        3. **Higher temperature** (0.8-1.2) for more creative output
+        4. **Adjust max_length** based on your needs
+        5. **Experiment with top-k and top-p** for different generation styles
+        ---
+        **Built with ❤️ using OpenLLM Framework**
+        """)
+    return interface
+# Create and launch the interface
+if __name__ == "__main__":
+    interface = create_interface()
+    interface.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        debug=True
+    )