Spaces:

Sualeh77
/

smollm2-135m-trained-on-tinyShakespear-forfun

Running

App Files Files Community

Sualeh Qureshi commited on 3 days ago

Commit

13f6128

1 Parent(s): dc69345

Added README.md, and app.py for huggingface space

Browse files

Files changed (5) hide show

.gitignore +3 -0
README.md +321 -0
README_SPACE.md +108 -0
app.py +259 -0
requirements.txt +5 -0

.gitignore CHANGED Viewed

@@ -12,3 +12,6 @@ wheels/
 # Checkpoints
 checkpoints/

 # Checkpoints
 checkpoints/
+# tensorboard logs
+logs/tensorboard/

README.md CHANGED Viewed

	@@ -0,0 +1,321 @@

+# SmolLM2-135M Implementation
+A from-scratch PyTorch implementation of the SmolLM2-135M language model, following the LLaMA architecture with modern optimizations.
+## Overview
+This repository contains a complete implementation of SmolLM2-135M, a 135 million parameter decoder-only transformer model. The implementation includes:
+- **Model Architecture** (`model.py`): Complete model definition with KV cache support
+- **Training Script** (`train.py`): PyTorch Lightning training with WSD scheduler
+- **Gradio App** (`app.py`): Interactive web interface for text generation
+## Model Architecture (`model.py`)
+### Architecture Components
+The model follows the LLaMA-style decoder-only transformer architecture with the following key components:
+#### 1. **SmolConfig** (Configuration Class)
+A dataclass that stores all model hyperparameters:
+```python
+@dataclass
+class SmolConfig:
+    vocab_size: int = 49152          # Vocabulary size
+    hidden_size: int = 576           # Hidden dimension
+    intermediate_size: int = 1536     # MLP intermediate dimension
+    num_hidden_layers: int = 30      # Number of transformer layers
+    num_attention_heads: int = 9      # Number of query heads
+    num_key_value_heads: int = 3     # Number of key/value heads (GQA)
+    max_position_embeddings: int = 8192  # Maximum sequence length
+    rope_theta: float = 100000.0     # RoPE base frequency
+    rms_norm_eps: float = 1e-5       # RMSNorm epsilon
+    attention_bias: bool = False     # Whether to use bias in attention
+    mlp_bias: bool = False           # Whether to use bias in MLP
+    dtype: torch.dtype = torch.bfloat16
+```
+**Key Features:**
+- `head_dim` property: Automatically computes head dimension (hidden_size // num_attention_heads = 64)
+- `from_hf()` class method: Loads configuration from HuggingFace model config
+#### 2. **RMSNorm** (Root Mean Square Normalization)
+Replaces LayerNorm with a more efficient normalization:
+```python
+class RMSNorm(nn.Module):
+    def forward(self, x):
+        norm = x.pow(2).mean(dim=-1, keepdim=True)
+        x = x * torch.rsqrt(norm + self.eps)
+        return self.weight * x
+```
+**Benefits:**
+- More efficient than LayerNorm (no mean subtraction)
+- Used throughout the model for pre-norm architecture
+#### 3. **RoPE** (Rotary Positional Embeddings)
+Rotary Position Embeddings applied to query and key tensors:
+```python
+def build_rope_cache(seq_len, head_dim, base, device, dtype):
+    # Computes cosine and sine caches for RoPE
+    inv_freq = 1.0 / (base ** (freq_seq / half_dim))
+    freqs = torch.outer(t, inv_freq)
+    cos = freqs.cos()[None, None, :, :]
+    sin = freqs.sin()[None, None, :, :]
+    return cos, sin
+def apply_rope(x, cos, sin):
+    # Applies rotary transformation to input tensor
+    x1, x2 = x[..., :half], x[..., half:]
+    x1_rot = x1 * cos - x2 * sin
+    x2_rot = x1 * sin + x2 * cos
+    return torch.cat([x1_rot, x2_rot], dim=-1)
+```
+**Key Features:**
+- Relative positional encoding (no absolute position embeddings)
+- Applied only to Q and K (not V)
+- Supports efficient caching for inference
+#### 4. **MultiHeadSelfAttention** (Grouped Query Attention)
+Implements GQA (Grouped Query Attention) where:
+- **Query heads**: 9 (full attention)
+- **Key/Value heads**: 3 (shared across query heads)
+```python
+class MultiHeadSelfAttention(nn.Module):
+    def forward(self, x, cos, sin, past_key_value=None, use_cache=False):
+        # 1. Project to Q, K, V
+        q = self.q_proj(x)  # (B, T, n_heads * head_dim)
+        k = self.k_proj(x)  # (B, T, n_kv_heads * head_dim)
+        v = self.v_proj(x)  # (B, T, n_kv_heads * head_dim)
+        # 2. Apply RoPE to Q and K
+        q = apply_rope(q, cos, sin)
+        k = apply_rope(k, cos, sin)
+        # 3. KV Cache support (for inference)
+        if past_key_value:
+            k = torch.cat([past_k, k], dim=2)
+            v = torch.cat([past_v, v], dim=2)
+        # 4. GQA: Expand K/V if needed
+        if n_kv_heads < n_heads:
+            k = k.repeat_interleave(repeat_factor, dim=1)
+            v = v.repeat_interleave(repeat_factor, dim=1)
+        # 5. Compute attention scores
+        scores = (q @ k.transpose(-2, -1)) / sqrt(head_dim)
+        scores = scores + causal_mask  # Causal masking
+        # 6. Softmax and weighted sum
+        probs = F.softmax(scores, dim=-1)
+        out = probs @ v
+        return out, present_key_value
+```
+**Key Features:**
+- **KV Cache**: Efficient inference by caching past key-value pairs
+- **GQA**: Reduces memory by sharing K/V heads (3:1 ratio)
+- **Causal Masking**: Prevents attending to future tokens
+- **RoPE Integration**: Positional encoding via rotary embeddings
+#### 5. **SmolMLP** (SwiGLU Activation)
+Implements the SwiGLU (Swish-Gated Linear Unit) MLP:
+```python
+class SmolMLP(nn.Module):
+    def forward(self, x):
+        # fc1 outputs 2 * intermediate_size
+        x = self.fc1(x)  # (B, T, 2 * 1536) = (B, T, 3072)
+        x1, x2 = x.chunk(2, dim=-1)  # Split into two parts
+        # SwiGLU: SiLU(x1) * x2
+        return self.fc2(F.silu(x1) * x2)
+```
+**Key Features:**
+- **SwiGLU**: `SiLU(x1) * x2` activation (better than ReLU/GELU)
+- **No bias**: Following LLaMA architecture
+- **Efficient**: Single matrix multiplication with split
+#### 6. **SmolBlock** (Transformer Block)
+Combines attention and MLP with pre-norm and residual connections:
+```python
+class SmolBlock(nn.Module):
+    def forward(self, x, cos, sin, past_key_value=None, use_cache=False):
+        # Pre-norm attention with residual
+        attn_out, present_kv = self.attn(
+            self.attn_norm(x), cos, sin,
+            past_key_value=past_key_value, use_cache=use_cache
+        )
+        x = x + attn_out
+        # Pre-norm MLP with residual
+        x = x + self.mlp(self.mlp_norm(x))
+        return x, present_kv
+```
+**Architecture:**
+- **Pre-norm**: Normalization before attention/MLP (not after)
+- **Residual connections**: Skip connections for gradient flow
+- **KV Cache passthrough**: Supports efficient inference
+#### 7. **SmolLM2** (Main Model)
+Top-level model that combines all components:
+```python
+class SmolLM2(nn.Module):
+    def __init__(self, config):
+        self.embed_tokens = nn.Embedding(vocab_size, hidden_size)
+        self.layers = nn.ModuleList([SmolBlock(config) for _ in range(30)])
+        self.norm = RMSNorm(hidden_size)
+        self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
+        # Weight tying: share embeddings and output weights
+        self.lm_head.weight = self.embed_tokens.weight
+    def forward(self, input_ids, past_key_values=None, use_cache=False):
+        # 1. Token embeddings
+        x = self.embed_tokens(input_ids)
+        # 2. Build RoPE cache
+        cos, sin = build_rope_cache(...)
+        # 3. Pass through transformer layers
+        present_key_values = []
+        for layer in self.layers:
+            x, present_kv = layer(x, cos, sin, past_key_value, use_cache)
+            if use_cache:
+                present_key_values.append(present_kv)
+        # 4. Final norm and language modeling head
+        x = self.norm(x)
+        logits = self.lm_head(x)
+        return logits, present_key_values
+```
+**Key Features:**
+- **Weight Tying**: Embeddings and output weights are shared (reduces parameters)
+- **KV Cache Support**: Full support for efficient autoregressive generation
+- **30 Layers**: Deep transformer stack for capacity
+#### 8. **Generate Method** (Text Generation)
+Autoregressive text generation with KV cache:
+```python
+@torch.no_grad()
+def generate(self, input_ids, max_new_tokens=100, temperature=1.0,
+             top_k=None, top_p=None, eos_token_id=None):
+    generated = input_ids
+    past_key_values = None
+    for _ in range(max_new_tokens):
+        # Forward pass with KV cache
+        logits, past_key_values = self.forward(
+            generated[:, -1:] if past_key_values else generated,
+            past_key_values=past_key_values,
+            use_cache=True
+        )
+        # Sample next token with temperature, top-k, top-p
+        next_token_logits = logits[:, -1, :] / temperature
+        # Apply top-k and top-p filtering
+        probs = F.softmax(next_token_logits, dim=-1)
+        next_token = torch.multinomial(probs, num_samples=1)
+        generated = torch.cat([generated, next_token], dim=1)
+        if eos_token_id and (next_token == eos_token_id).all():
+            break
+    return generated
+```
+**Key Features:**
+- **KV Cache**: Only processes new tokens (not entire sequence)
+- **Sampling**: Supports temperature, top-k, and top-p (nucleus) sampling
+- **Efficient**: O(1) per token after initial forward pass
+### Model Specifications
+| Parameter | Value |
+|-----------|-------|
+| **Total Parameters** | ~135M |
+| **Hidden Size** | 576 |
+| **Layers** | 30 |
+| **Attention Heads** | 9 (Q), 3 (K/V) |
+| **Head Dimension** | 64 |
+| **Intermediate Size** | 1536 |
+| **Vocabulary Size** | 49,152 |
+| **Max Sequence Length** | 8,192 |
+| **RoPE Theta** | 100,000 |
+| **Activation** | SwiGLU (SiLU-gated) |
+| **Normalization** | RMSNorm |
+| **Weight Tying** | Yes (embeddings = output) |
+### Key Design Choices
+1. **GQA (Grouped Query Attention)**: 3:1 ratio reduces memory by 66% for K/V cache
+2. **Pre-norm Architecture**: More stable training than post-norm
+3. **RMSNorm**: Faster and simpler than LayerNorm
+4. **RoPE**: Relative positional encoding, no learned embeddings
+5. **SwiGLU**: Better activation than ReLU/GELU
+6. **Weight Tying**: Reduces parameters and improves generalization
+7. **No Biases**: Following LLaMA, reduces parameters slightly
+### Usage Example
+```python
+from model import SmolConfig, SmolLM2
+from transformers import AutoConfig
+# Load config from HuggingFace
+hf_config = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+config = SmolConfig.from_hf(hf_config)
+# Create model
+model = SmolLM2(config)
+# Forward pass (training)
+input_ids = torch.randint(0, config.vocab_size, (2, 512))
+logits, _ = model(input_ids, use_cache=False)
+# Text generation (inference with KV cache)
+prompt_ids = tokenizer.encode("Hello, how are you?")
+generated = model.generate(
+    prompt_ids,
+    max_new_tokens=100,
+    temperature=0.8,
+    top_k=50
+)
+```
+## Training
+See `README_TRAINING.md` for detailed training instructions.
+## Inference
+See `app.py` for the Gradio web interface or use the `generate()` method directly.
+## References
+- [SmolLM2 Paper](https://arxiv.org/abs/2406.02528)
+- [LLaMA Architecture](https://arxiv.org/abs/2302.13971)
+- [RoPE: Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
+- [SwiGLU Activation](https://arxiv.org/abs/2002.05202)

README_SPACE.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# SmolLM2-135M Hugging Face Space Setup Guide
+This guide explains how to push your model and app to Hugging Face Spaces.
+## Files Needed for Hugging Face Space
+1. **app.py** - Main Gradio application (already created)
+2. **model.py** - Model definition
+3. **train.py** - Contains SmolLM2Module class (needed for loading checkpoints)
+4. **requirements.txt** - Python dependencies
+5. **README.md** - Space description (optional but recommended)
+## Step-by-Step Guide
+### 1. Fix Merge Conflicts (if still present)
+If you still have merge conflicts, resolve them:
+```bash
+# Check status
+git status
+# Resolve conflicts in train.py and pyproject.toml
+# Then commit
+git add train.py pyproject.toml
+git commit -m "Resolve merge conflicts"
+```
+### 2. Create Hugging Face Space (if not already created)
+```bash
+# Create the space (without --sdk flag, set it in web UI)
+huggingface-cli repo create smollm2-135m-trained-on-tinyShakespear-forfun --type=space
+```
+Then go to the Space settings in the web UI and set:
+- **SDK**: Gradio
+- **Python version**: 3.12
+### 3. Add Hugging Face Remote
+```bash
+# Add HF Space as remote (different name to avoid confusion with GitHub)
+git remote add huggingface https://huggingface.co/spaces/Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun
+```
+### 4. Prepare Files for Space
+Make sure these files are ready:
+- ✅ `app.py` - Main app (loads from HF model repo)
+- ✅ `model.py` - Model definition
+- ✅ `train.py` - Contains SmolLM2Module
+- ✅ `requirements.txt` - Dependencies
+- ✅ `.gitignore` - Should exclude logs/, checkpoints/, etc.
+### 5. Push to Hugging Face Space
+```bash
+# First, disable GPG signing temporarily (if you had issues)
+git config --global commit.gpgsign false
+# Add and commit files
+git add app.py model.py train.py requirements.txt .gitignore
+git commit -m "Add Gradio app for SmolLM2-135M inference"
+# Push to Hugging Face Space
+git push huggingface main
+# Re-enable GPG signing if you want
+git config --global commit.gpgsign true
+```
+### 6. Verify on Hugging Face
+1. Go to your Space: https://huggingface.co/spaces/Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun
+2. Check the "Files" tab - you should see `app.py`, `model.py`, `train.py`, `requirements.txt`
+3. The Space should automatically build and deploy
+4. Once built, you can test the app in the web interface
+## Important Notes
+- **Model Loading**: The app automatically loads from `Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun` model repo
+- **Checkpoint**: Uses `smollm2-step=05000-train_loss=0.0918.ckpt`
+- **First Load**: The first time the Space loads, it will download the checkpoint from the model repo (may take a few minutes)
+- **Caching**: Subsequent loads will be faster due to Hugging Face caching
+## Troubleshooting
+### If push fails with "non-fast-forward":
+```bash
+# Fetch latest
+git fetch huggingface
+# Rebase (without GPG signing)
+git config --global commit.gpgsign false
+git rebase huggingface/main
+git push huggingface main
+git config --global commit.gpgsign true
+```
+### If Space build fails:
+- Check the "Logs" tab in your Space
+- Ensure all dependencies are in `requirements.txt`
+- Make sure `app.py` is the entry point (it should be automatically detected)
+### If model loading fails:
+- Verify the model repo name is correct: `Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun`
+- Verify the checkpoint name: `smollm2-step=05000-train_loss=0.0918.ckpt`
+- Check that the checkpoint file exists in the model repo

app.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""
+Gradio app for SmolLM2-135M inference with streaming output.
+Loads model from Hugging Face model repo.
+"""
+import sys
+from pathlib import Path
+from typing import List, Optional
+import os
+import gradio as gr
+import torch
+from transformers import AutoConfig, AutoTokenizer
+from huggingface_hub import hf_hub_download
+from model import SmolConfig, SmolLM2
+from train import SmolLM2Module
+# Hugging Face model repo configuration
+HF_MODEL_REPO = "Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun"
+CHECKPOINT_NAME = "smollm2-step=05000-train_loss=0.0918.ckpt"
+# Device setup
+DEVICE = "cpu"
+if torch.cuda.is_available():
+    DEVICE = "cuda"
+elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+    DEVICE = "mps"
+# Globals
+model: Optional[SmolLM2] = None
+tokenizer = None
+# Allow SmolConfig to be deserialized from Lightning checkpoints when torch.load
+try:
+    torch.serialization.add_safe_globals([SmolConfig])  # type: ignore[attr-defined]
+except Exception:
+    pass
+def load_model_checkpoint(checkpoint_path: Optional[str] = None, use_hf: bool = True):
+    """Load Lightning checkpoint from Hugging Face Hub or local path."""
+    global model, tokenizer
+    try:
+        # Load tokenizer and config from Hugging Face
+        hf_cfg = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+        config = SmolConfig.from_hf(hf_cfg)
+        tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        # Determine checkpoint path
+        if use_hf and checkpoint_path is None:
+            # Download from Hugging Face Hub
+            try:
+                local_ckpt = hf_hub_download(
+                    repo_id=HF_MODEL_REPO,
+                    filename=CHECKPOINT_NAME,
+                    cache_dir=None,  # Use default cache
+                )
+                checkpoint_path = local_ckpt
+                status_msg = f"✅ Model loaded from Hugging Face: {HF_MODEL_REPO}/{CHECKPOINT_NAME}"
+            except Exception as e:
+                return f"❌ Failed to download from HF Hub: {e}"
+        elif checkpoint_path:
+            # Use local path
+            ckpt = Path(checkpoint_path)
+            if not ckpt.exists():
+                return f"❌ Checkpoint not found: {ckpt}"
+            status_msg = f"✅ Model loaded from local path: {checkpoint_path}"
+        else:
+            return "❌ No checkpoint path provided"
+        # Load the Lightning module
+        module = SmolLM2Module.load_from_checkpoint(
+            str(checkpoint_path),
+            config=config,
+            tokenizer=tokenizer,
+            map_location=DEVICE,
+            strict=False,
+        )
+        module.eval()
+        model = module.model.to(DEVICE).eval()
+        return f"{status_msg} on {DEVICE}"
+    except Exception as e:
+        model = None
+        return f"❌ Error loading model: {e}"
+def stream_generate(
+    prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_k: int,
+    top_p: float,
+):
+    """Generator that yields only the generated text (without prompt)."""
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        yield "⚠️ Load the model first (click Reload Model)."
+        return
+    if not prompt or not prompt.strip():
+        yield "⚠️ Please enter a prompt."
+        return
+    # Tokenize prompt
+    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+    input_ids = inputs["input_ids"].to(DEVICE)
+    # Guard against context overflow
+    if input_ids.shape[1] >= model.config.max_position_embeddings:
+        yield f"⚠️ Prompt too long ({input_ids.shape[1]} tokens). Max is {model.config.max_position_embeddings}."
+        return
+    generated = input_ids
+    past_key_values: Optional[List] = None
+    prompt_length = input_ids.shape[1]
+    with torch.no_grad():
+        for _ in range(max_new_tokens):
+            if past_key_values is None:
+                current_input = generated
+            else:
+                current_input = generated[:, -1:]
+            logits, past_key_values = model(
+                current_input,
+                past_key_values=past_key_values,
+                use_cache=True,
+            )
+            next_token_logits = logits[:, -1, :] / max(temperature, 1e-6)
+            # top-k
+            if top_k > 0:
+                values, _ = torch.topk(next_token_logits, top_k)
+                min_keep = values[:, -1].unsqueeze(-1)
+                next_token_logits = torch.where(
+                    next_token_logits < min_keep,
+                    torch.full_like(next_token_logits, float("-inf")),
+                    next_token_logits,
+                )
+            # top-p
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                probs = torch.softmax(sorted_logits, dim=-1)
+                cumulative = torch.cumsum(probs, dim=-1)
+                sorted_mask = cumulative > top_p
+                sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
+                sorted_mask[..., 0] = 0
+                mask = sorted_mask.scatter(1, sorted_indices, sorted_mask)
+                next_token_logits = torch.where(mask, torch.full_like(next_token_logits, float("-inf")), next_token_logits)
+            probs = torch.softmax(next_token_logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            generated = torch.cat([generated, next_token], dim=1)
+            # Decode only the generated part (skip the prompt)
+            generated_text = tokenizer.decode(generated[0][prompt_length:], skip_special_tokens=True)
+            yield generated_text
+# Initial load from Hugging Face
+INITIAL_STATUS = load_model_checkpoint(use_hf=True)
+def chat_stream(message, history, max_tokens, temperature, top_k, top_p):
+    """Gradio wrapper for streaming chat."""
+    if history is None:
+        history = []
+    # Convert history from tuple format to dict format if needed
+    if history and isinstance(history[0], (list, tuple)):
+        new_history = []
+        for h in history:
+            if isinstance(h, (list, tuple)) and len(h) >= 2:
+                if h[0]:  # User message
+                    new_history.append({"role": "user", "content": str(h[0])})
+                if h[1]:  # Assistant message
+                    new_history.append({"role": "assistant", "content": str(h[1])})
+        history = new_history
+    # Append user message
+    user_msg = (message or "").strip()
+    if not user_msg:
+        yield history
+        return
+    history.append({"role": "user", "content": user_msg})
+    history.append({"role": "assistant", "content": ""})
+    stream = stream_generate(user_msg, max_tokens, temperature, top_k, top_p)
+    for partial in stream:
+        # Update the last assistant message with generated text
+        if partial:
+            history[-1] = {"role": "assistant", "content": str(partial)}
+        yield history
+def clear_chat():
+    return "", []
+with gr.Blocks(title="SmolLM2-135M Text Generator") as demo:
+    gr.Markdown(
+        """
+        # 🤖 SmolLM2-135M Text Generator
+        Generate text with your trained SmolLM2-135M model (streaming output).
+        **Model:** Trained on TinyShakespeare dataset
+        **Source:** [Hugging Face Model Repo](https://huggingface.co/Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun)
+        """
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### Model Status")
+            status_text = gr.Textbox(value=INITIAL_STATUS, label="Status", interactive=False, lines=3)
+            load_btn = gr.Button("🔄 Reload Model from HF", variant="secondary")
+            load_btn.click(fn=lambda: load_model_checkpoint(use_hf=True), outputs=status_text)
+            gr.Markdown("### Local Checkpoint (Optional)")
+            ckpt_input = gr.Textbox(
+                value="",
+                label="Local checkpoint path (leave empty to use HF)",
+                interactive=True,
+            )
+            load_local_btn = gr.Button("📁 Load from Local Path", variant="secondary")
+            load_local_btn.click(
+                fn=lambda p: load_model_checkpoint(checkpoint_path=p, use_hf=False) if p else "⚠️ Please enter a path",
+                inputs=ckpt_input,
+                outputs=status_text
+            )
+            gr.Markdown("### Generation Parameters")
+            max_tokens = gr.Slider(10, 500, value=100, step=10, label="Max Tokens")
+            temperature = gr.Slider(0.1, 2.0, value=0.8, step=0.1, label="Temperature")
+            top_k = gr.Slider(0, 100, value=50, step=5, label="Top-K")
+            top_p = gr.Slider(0.1, 1.0, value=1.0, step=0.05, label="Top-P")
+        with gr.Column(scale=2):
+            gr.Markdown("### 💬 Chat Interface")
+            chatbot = gr.Chatbot(label="Conversation", height=500)
+            with gr.Row():
+                msg = gr.Textbox(label="Your Message", placeholder="Type your prompt here...", scale=4, lines=2)
+                submit_btn = gr.Button("Send ➤", variant="primary", scale=1)
+            clear_btn = gr.Button("🗑️ Clear Chat", variant="stop")
+    msg.submit(fn=chat_stream, inputs=[msg, chatbot, max_tokens, temperature, top_k, top_p], outputs=chatbot)
+    submit_btn.click(fn=chat_stream, inputs=[msg, chatbot, max_tokens, temperature, top_k, top_p], outputs=chatbot).then(fn=lambda: "", outputs=msg)
+    clear_btn.click(fn=clear_chat, outputs=[msg, chatbot])
+if __name__ == "__main__":
+    demo.queue().launch(share=False, server_name="0.0.0.0", server_port=7860)

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+torch>=2.9.1
+lightning>=2.6.0
+transformers>=4.57.3
+gradio>=4.44.0
+huggingface-hub>=0.20.0