# Semantic Explorer - Project Documentation

**Project Type:** Gradio Web Application  
**Target Platform:** Hugging Face Spaces (Gradio SDK)  
**Development Environment:** Ubuntu Linux with Cursor IDE  
**Python Version:** 3.12  
**Primary Developer:** Josh (jnalv)

---

## Project Overview

### What This App Does

Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:

1. **Compare embeddings across models** - Select from 2 different embedding models to see how they represent semantic meaning
2. **Find similar and dissimilar words** - Discover the 20 most and 20 least semantically similar words to a reference word
3. **Compare words directly** - Get similarity scores between a reference word and up to 10 comparison words

### Core Functionality

The app operates in two phases:

**Phase 1: Preparation (Local, one-time per model)**
- User runs `prepare_embeddings.py` with a word list and chosen embedding model
- Script generates embeddings for all words (typically 100k+ words)
- Embeddings stored in ChromaDB vector database
- Each model gets its own collection (e.g., `words_bge_large_en_v1_5`)

**Phase 2: Exploration (Web interface)**
- User launches Gradio app via `app.py` (deployed on HuggingFace Spaces)
- Selects embedding model from radio buttons
- Models and collections lazy-load on demand (cached after first load)
- Performs similarity searches and comparisons in real-time

### Primary Use Case

Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.

---

## Project Goals

### Immediate Goals

1. **Hugging Face Spaces Deployment** ✅ COMPLETE
   - App deployed and live on HuggingFace Spaces
   - 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
   - Models load directly from HuggingFace Hub
   - ChromaDB databases pre-computed and uploaded
   - Lazy loading and caching implemented for optimal performance

2. **User Experience Polish** ✅ COMPLETE
   - Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
   - Radio buttons for model selection
   - Fixed 20 results in Most & Least Similar (no slider)
   - Clear model labels in all results
   - Mobile-responsive design

### Long-Term Goals

- Support for custom word lists via file upload
- Visualization of embedding spaces (t-SNE/UMAP)
- Export similarity results as CSV
- Multi-language support beyond English
- API endpoint for programmatic access

---

## Technical Architecture

### Technology Stack

**Core Dependencies:**
```
gradio>=4.0.0              # Web interface framework
chromadb>=0.4.0            # Vector database
sentence-transformers>=2.2.0  # Embedding model interface
numpy>=1.24.0              # Numerical operations
torch>=2.0.0               # PyTorch backend
einops>=0.6.0              # Tensor operations (Jina v4)
peft>=0.7.0                # LoRA adapters (Jina v4)
tqdm>=4.65.0               # Progress bars
```

**Hardware Requirements:**
- GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
- RAM: 32GB DDR4
- Storage: ~500MB per model + ~1GB per embedded word list

### Embedding Models (Currently Deployed)

| Model | Params | Dimensions | VRAM | Special Requirements |
|-------|--------|-----------|------|---------------------|
| all-MiniLM-L6-v2 | 22M | 384 | ~0.5GB | None |
| BGE Large EN v1.5 | 335M | 1024 | ~2GB | None |

**Model-Specific Notes:**
- Both models load directly from HuggingFace Hub
- All models use sentence-transformers library for consistent interface
- Models are lazy-loaded and cached for performance

### Data Flow

```
User Input (reference word)
    ↓
Model Selection (dropdown)
    ↓
Lazy Load Model + Collection (if not cached)
    ↓
Encode Reference Word
    ↓
Vector Similarity Search (ChromaDB or numpy)
    ↓
Format Results with Model Label
    ↓
Display in Gradio Interface
```

### File Structure

```
/home/jnalv/llm-projects/semantic-explorer/
├── app.py                      # Main Gradio interface (MOST IMPORTANT)
├── prepare_embeddings.py       # One-time embedding generation
├── requirements.txt            # Python dependencies
├── README.md                   # User-facing documentation
├── QUICKSTART.md              # Getting started guide
├── DEPLOYMENT.md              # HuggingFace Spaces deployment
├── CHANGES.md                 # Changelog
├── documents/                 # Word lists (input data)
│   └── english-words.txt      # ~100k English words
├── models/                    # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
├── chromadb/                  # Vector databases (one per model)
│   ├── words_all_MiniLM_L6_v2/
│   ├── words_bge_large_en_v1_5/
│   ├── words_jina_embeddings_v4/
│   ├── words_Qwen3_Embedding_0_6B/
│   ├── words_embeddinggemma_300m/
│   └── words_nomic_embed_text_v1_5/
└── [documentation files]
```

---

## Key Implementation Details

### Model Registry Pattern

Models are defined in a central registry (`AVAILABLE_MODELS` dict in `app.py`):

```python
AVAILABLE_MODELS = {
    "model-key": {
        "name": "model-key",
        "display": "Model Name: X dimensions",  # Shown in dropdown
        "hf_id": "org/model-name",              # HuggingFace model ID
        "dimensions": 1024,                      # Embedding dimensions
        "trust_remote_code": False,             # Security flag
        "default_task": "retrieval"             # Optional: for task-specific models
    }
}
```

**Adding a new model requires:**
1. Entry in `AVAILABLE_MODELS` (app.py)
2. Entry in `MODELS_REQUIRING_TRUST` if needed (prepare_embeddings.py)
3. Entry in `MODELS_REQUIRING_TASK` if needed (prepare_embeddings.py)
4. Running `prepare_embeddings.py` to create its collection

### Collection Naming Convention

Collections are auto-generated from model keys:
- Pattern: `words_{model_key}` with special chars replaced by underscores
- Example: `jinaai/jina-embeddings-v4` → `words_jina_embeddings_v4`
- Implementation: `get_collection_name()` function

### Lazy Loading Strategy

Models and collections are only loaded when first selected:

```python
loaded_models = {}       # Cache: model_key → SentenceTransformer
loaded_collections = {}  # Cache: model_key → ChromaDB Collection

def load_model_and_collection(model_key):
    if model_key in loaded_models:
        return loaded_models[model_key], loaded_collections[model_key]
    # ... load and cache
```

**Benefits:**
- Fast app startup (no models loaded initially)
- Memory efficient (only load what's used)
- Smooth model switching (cached after first load)

### Similarity Computation

**Most/Least Similar:**
- Retrieves all embeddings from collection
- Computes cosine similarity with numpy
- Sorts and returns top n / bottom n

**Target Score:**
- Finds words within tolerance of target score
- If not enough exact matches, shows "next-nearest"
- Random sampling if too many exact matches

**Comparison Tool:**
- Encodes reference + comparison words in single batch
- Computes pairwise similarities
- Sorts by similarity (descending)

---

## Critical Code Patterns

### Trust Remote Code Handling

Some models require custom code execution:

```python
# In model registry
"trust_remote_code": True

# In loading code
model = SentenceTransformer(
    model_id,
    trust_remote_code=model_info.get("trust_remote_code", False)
)
```

**Models requiring this:** Jina v4, Nomic v1.5

### Task-Specific Model Handling

Jina v4 uses LoRA adapters and requires a task:

```python
# In model registry
"default_task": "retrieval"

# In loading code
model_kwargs = {}
if "default_task" in model_info:
    model_kwargs["default_task"] = model_info["default_task"]

model = SentenceTransformer(
    model_id,
    model_kwargs=model_kwargs if model_kwargs else None
)
```

### Gradio Tab Structure

Interface uses tabs for different modes:

```python
with gr.Blocks() as app:
    model_selector = gr.Radio(choices=[...])  # Top-level selector with radio buttons
    
    with gr.Tabs():
        with gr.Tab("Comparison Tool"):      # Tab 1
            # ... comparison UI (up to 10 words)
        with gr.Tab("Most & Least Similar"): # Tab 2
            # ... shows 20 most + 20 least similar words
```

**Design rationale:**
- Comparison Tool first (most direct use case)
- Combined Most & Least (better for mobile, shows contrast)
- Target Score removed (didn't function as needed)
- Radio buttons for model selection (cleaner with only 2 models)
- Fixed 20 results in Most & Least (no slider needed)

---

## Development History

### Model Evolution

**Removed Models:**
- E5 Large v2 → "clumpy" semantic clustering
- Qwen2.5 Embedding 7B → OOM errors (14GB VRAM)
- Jina Embeddings v3 → Superseded by v4

**Current Models (2 deployed on HF Spaces):**
- all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
- BGE Large EN v1.5: Larger, more capable (1024 dimensions)
- Both models chosen for quality and HF Spaces compatibility

### Interface Evolution

**Original Design:**
- 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
- Model name shown at bottom
- Dropdown for model selection

**Previous Design (6 models):**
- 3 tabs (Comparison, Most & Least, Target Score)
- Model dropdown at top
- Slider for result count

**Current Design (Production on HF Spaces):**
- 2 tabs (Comparison, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar
- Model name shown in each result
- Streamlined, mobile-optimized interface

---

## Deployment Status (HuggingFace Spaces)

### Current Deployment ✅

**Live on HuggingFace Spaces:**
- URL: hf.co/spaces/jnalv/semantic-explorer
- 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
- ChromaDB databases pre-computed and uploaded
- Models load directly from HuggingFace Hub (no local paths)
- Lazy loading implemented for optimal cold-start performance

**Deployment Configuration:**
1. `app.py` - Main application file (HF Spaces runs this automatically)
2. Models selected for HF Spaces storage limits
3. Embeddings pre-computed locally and uploaded
4. ChromaDB databases included in repository

**README.md metadata (in production):**
```yaml
---
title: Semantic Word Explorer
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
---
```

### Deployment Notes

**Storage:**
- ChromaDB databases (~1GB each for 2 models)
- Managed within HF Spaces storage limits

**Performance:**
- Models lazy-load on first selection
- Cached after first load for fast switching
- Cold start: ~30-60 seconds per model (acceptable for free tier)

**Resource Management:**
- 2 models fit comfortably in HF Spaces memory
- Lazy loading ensures efficient resource usage

---

## Developer Notes

### Josh's Preferences

**Code style:**
- Practical over theoretical
- Working prototypes over perfect architecture
- Remove features that don't add clear value
- Streamlined, mobile-friendly interfaces
- No excessive explanation in UI

**Development approach:**
- Iterative refinement based on testing
- Build complete prototypes to understand concepts
- Modular structure (separate projects, shared resources)
- Git hygiene with virtual environments

### Testing Workflow

```bash
# Local testing
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer

# Prepare embeddings for a model
python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model BAAI/bge-large-en-v1.5

# Launch app
python app.py

# Access at http://localhost:7860
```

### Common Issues & Solutions

**Issue:** Model not loading  
**Solution:** Check that `prepare_embeddings.py` was run for that model

**Issue:** Collection not found  
**Solution:** Collection names auto-generated; verify with ChromaDB

**Issue:** OOM during embedding generation  
**Solution:** Reduce `--batch-size` parameter

**Issue:** Jina v4 task error  
**Solution:** Ensure `default_task` is set in model registry

---

## AI Agent Instructions

When working with this codebase:

1. **Adding new models:** Update `AVAILABLE_MODELS` in app.py, add trust/task requirements if needed
2. **Modifying UI:** Edit Gradio components in `create_interface()` function
3. **Changing similarity logic:** Modify functions in app.py (`find_most_and_least_similar`, etc.)
4. **Deployment prep:** Focus on `app_hf_spaces.py` and adapt for Spaces environment
5. **Documentation:** Update relevant .md files when making significant changes

### Code Modification Guidelines

- Preserve the lazy loading pattern (don't load all models at startup)
- Maintain the model registry structure (central source of truth)
- Keep model labels in results (user feedback requirement)
- Test with multiple models before committing
- Update requirements.txt for new dependencies

### Files You'll Most Often Edit

1. `app.py` - Main application logic (deployed to HF Spaces)
2. `prepare_embeddings.py` - Embedding generation (local only, for adding new models)
3. `requirements.txt` - Dependencies
4. `README.md` / `QUICKSTART.md` - User documentation

### Important Files

**Production:**
- `app.py` - Single app file (loads models from HuggingFace Hub)
- `chromadb/` - Pre-computed embeddings for deployed models
- `requirements.txt` - Dependencies for HF Spaces

**Local Development:**
- `prepare_embeddings.py` - Generate embeddings locally before deployment
- `documents/english-words.txt` - Word list (~100k words)

### Files You Shouldn't Usually Edit

- `chromadb/` - Generated data
- `documents/` - Input data
- `models/` - Downloaded models
- Backup files (*.backup)

---

## Quick Reference

### Run Locally
```bash
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
python app.py
```

### Prepare New Model
```bash
python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model <model-hf-id>
```

### Check Available Models
```bash
# Look at AVAILABLE_MODELS in app.py
# or run:
python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"
```

### Test Specific Model
```bash
# Launch app, select model from dropdown, test each tab
```

---

## Related Documentation

- `QUICKSTART.md` - Getting started guide
- `README.md` - Comprehensive user documentation
- `DEPLOYMENT.md` - HuggingFace Spaces deployment (original)
- `HUGGINGFACE_SPACES_WORKFLOW.md` - Git workflow for Spaces
- `CHANGES.md` - Change history
- `NEW_MODELS.md` - Latest model additions
- `JINA_V4_TASK_FIX.md` - Jina v4 task requirement fix

---

## Version Information

**Current State:** Live on HuggingFace Spaces  
**Deployment Date:** January 2025  
**Last Major Update:** January 2025 - Streamlined to 2 tabs, radio buttons, 2 models  
**Python Version:** 3.12  
**Gradio Version:** 4.0.0+  
**Deployed Models:** 2 embedding models (384 and 1024 dimensions)

---

**END OF PROJECT DOCUMENTATION**