---
title: LLM Decoding Strategy Analyzer
emoji: 🔬
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.5.0
app_file: app.py
pinned: false
license: mit
---

# 🔬 LLM Decoding Strategy Analyzer

An interactive tool to compare 5 text generation decoding strategies side-by-side using GPT-2.

## 🎯 Purpose

This project demonstrates **how different decoding strategies affect text generation quality**. Rather than just showing that language models can generate text, it explains *why* the outputs look the way they do based on the underlying algorithms.

**Target Audience:** AI/ML practitioners, researchers, and anyone interested in understanding LLM text generation mechanics.

## 🏗️ Architecture
```
User Prompt → GPT-2 Tokenizer → GPT-2 Model (124M params)
                                      ↓
                    ┌─────────────────┴─────────────────┐
                    ↓         ↓         ↓         ↓         ↓
                 Greedy    Beam     Top-K    Top-P    Temp+Top-P
                    ↓         ↓         ↓         ↓         ↓
                    └─────────────────┬─────────────────┘
                                      ↓
                           Side-by-Side Comparison UI
```

## 📊 Decoding Strategies Explained

### 1. Greedy Decoding
- **How it works:** Always selects the token with the highest probability
- **Parameters:** `do_sample=False`
- **Behavior:** Deterministic — same input always produces same output
- **Problem:** Tends to produce repetitive, "safe" text that often loops

### 2. Beam Search
- **How it works:** Maintains top-k hypotheses at each step, exploring multiple paths simultaneously
- **Parameters:** `num_beams=5, no_repeat_ngram_size=2`
- **Behavior:** Deterministic but explores more possibilities than greedy
- **Problem:** Still conservative; produces coherent but predictable text

### 3. Top-K Sampling
- **How it works:** Samples randomly from the K most likely tokens
- **Parameters:** `top_k=50, temperature=1.0`
- **Behavior:** Stochastic — adds variety to outputs
- **Problem:** Fixed K doesn't adapt to the shape of the probability distribution

### 4. Top-P (Nucleus) Sampling
- **How it works:** Samples from the smallest set of tokens whose cumulative probability exceeds P
- **Parameters:** `top_p=0.95, temperature=1.0`
- **Behavior:** Adapts to distribution shape — uses fewer tokens when model is confident
- **Advantage:** Best balance of creativity and coherence
- **Reference:** Holtzman et al. (2019)

### 5. Temperature + Top-P
- **How it works:** Scales logits by temperature before applying top-p sampling
- **Parameters:** `temperature=0.7, top_p=0.95`
- **Behavior:** Lower temperature sharpens distribution; combined with top-p for quality
- **Advantage:** Fine-grained control over creativity vs. focus

## 📚 Research Foundation

This project is grounded in academic research on neural text generation:

| Paper | Authors | Year | Key Contribution |
|-------|---------|------|------------------|
| [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751) | Holtzman et al. | 2019 | Introduced Nucleus (Top-P) Sampling; explained why maximization-based decoding fails |
| [If beam search is the answer, what was the question?](https://arxiv.org/abs/2010.02650) | Meister et al. | 2020 | Analyzed why beam search works despite high search error; uniform information density |
| [Mirostat: A Neural Text Decoding Algorithm](https://arxiv.org/abs/2007.14966) | Basu et al. | 2020 | Mathematical analysis of top-k, top-p, and temperature; identified "boredom trap" and "confusion trap" |
| [Generative AI-Based Text Generation Methods Using Pre-Trained GPT-2 Model](https://arxiv.org/abs/2404.01786) | Pandey et al. | 2024 | Comprehensive comparison of decoding strategies on GPT-2 |

## 🔍 Key Observations

From testing these strategies, we consistently observe:

| Strategy | Repetition Risk | Creativity | Coherence | Recommended Use |
|----------|-----------------|------------|-----------|-----------------|
| Greedy | ❌ High | ❌ Low | ⚠️ Degrades | Baseline comparison only |
| Beam Search | ✅ Low | ❌ Low | ✅ High | Factual/structured tasks |
| Top-K | ✅ Low | ✅ High | ⚠️ Variable | Creative exploration |
| Top-P | ✅ Low | ✅ High | ✅ High | General purpose |
| Temp + Top-P | ✅ Low | ✅ Moderate | ✅ High | Production applications |

**Example of Greedy Degeneration:**
```
Input: "In a distant galaxy, a lone astronaut discovered"
Output: "...The astronaut was able to see the planet's gravity.
        The astronaut was able to see the planet's gravity.
        The astronaut was able to see the planet's gravity..."
```

This repetition loop is the "text degeneration" phenomenon identified by Holtzman et al.

## 🛠️ Technical Implementation

| Component | Choice | Rationale |
|-----------|--------|-----------|
| Model | GPT-2 (124M) | Large enough for quality differences; small enough for CPU inference |
| Framework | Hugging Face Transformers | Industry standard; well-documented |
| Interface | Gradio | Quick deployment; interactive comparison |
| Timing | `torch.cuda.synchronize()` | Accurate GPU timing measurement |

## 🧪 Development Challenges

### Challenge 1: Gradio Theme Deprecation
**Problem:** `DeprecationWarning: The 'theme' parameter in Blocks constructor will be removed in Gradio 6.0`

**Investigation:** Gradio 5.x/6.x changed where the theme parameter should be placed.

**Solution:** Kept theme in constructor for compatibility with current SDK version; documented for future migration.


## 📁 Repository Structure
```
llm-decoding-strategies/
├── app.py              # Main application with all strategies
├── requirements.txt    # Python dependencies
└── README.md          # This documentation
```

## 🚀 Local Development
```bash
# Clone the repository
git clone https://huggingface.co/spaces/Nav772/llm-decoding-strategies

# Install dependencies
pip install -r requirements.txt

# Run locally
python app.py
```

## 📝 Limitations

- **Model Size:** GPT-2 (124M) is relatively small by modern standards; larger models would show more nuanced differences
- **CPU Inference:** Running on CPU free tier; generation is slower than GPU
- **Context Length:** Limited to 1024 tokens (GPT-2 maximum)
- **English Only:** GPT-2 was primarily trained on English text

## 👤 Author

**[Nav772](https://huggingface.co/Nav772)** — Built as part of an AI Engineering portfolio demonstrating understanding of LLM text generation mechanics.

## 📚 Related Projects

- [RAG Document Q&A](https://huggingface.co/spaces/Nav772/rag-document-qa) — Retrieval-Augmented Generation system
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer) — NLP classification
- [Food Image Classifier](https://huggingface.co/spaces/Nav772/food-image-classifier) — Computer vision

## 📄 License

MIT License — See LICENSE file for details.