semantic-explorer / PROJECT.md
jnalv's picture
Cleanup: remove app_hf_spaces.py, update PROJECT.md to reflect current deployed state
87f87b4

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

Semantic Explorer - Project Documentation

Project Type: Gradio Web Application
Target Platform: Hugging Face Spaces (Gradio SDK)
Development Environment: Ubuntu Linux with Cursor IDE
Python Version: 3.12
Primary Developer: Josh (jnalv)


Project Overview

What This App Does

Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:

  1. Compare embeddings across models - Select from 2 different embedding models to see how they represent semantic meaning
  2. Find similar and dissimilar words - Discover the 20 most and 20 least semantically similar words to a reference word
  3. Compare words directly - Get similarity scores between a reference word and up to 10 comparison words

Core Functionality

The app operates in two phases:

Phase 1: Preparation (Local, one-time per model)

  • User runs prepare_embeddings.py with a word list and chosen embedding model
  • Script generates embeddings for all words (typically 100k+ words)
  • Embeddings stored in ChromaDB vector database
  • Each model gets its own collection (e.g., words_bge_large_en_v1_5)

Phase 2: Exploration (Web interface)

  • User launches Gradio app via app.py (deployed on HuggingFace Spaces)
  • Selects embedding model from radio buttons
  • Models and collections lazy-load on demand (cached after first load)
  • Performs similarity searches and comparisons in real-time

Primary Use Case

Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.


Project Goals

Immediate Goals

  1. Hugging Face Spaces Deployment βœ… COMPLETE

    • App deployed and live on HuggingFace Spaces
    • 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
    • Models load directly from HuggingFace Hub
    • ChromaDB databases pre-computed and uploaded
    • Lazy loading and caching implemented for optimal performance
  2. User Experience Polish βœ… COMPLETE

    • Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
    • Radio buttons for model selection
    • Fixed 20 results in Most & Least Similar (no slider)
    • Clear model labels in all results
    • Mobile-responsive design

Long-Term Goals

  • Support for custom word lists via file upload
  • Visualization of embedding spaces (t-SNE/UMAP)
  • Export similarity results as CSV
  • Multi-language support beyond English
  • API endpoint for programmatic access

Technical Architecture

Technology Stack

Core Dependencies:

gradio>=4.0.0              # Web interface framework
chromadb>=0.4.0            # Vector database
sentence-transformers>=2.2.0  # Embedding model interface
numpy>=1.24.0              # Numerical operations
torch>=2.0.0               # PyTorch backend
einops>=0.6.0              # Tensor operations (Jina v4)
peft>=0.7.0                # LoRA adapters (Jina v4)
tqdm>=4.65.0               # Progress bars

Hardware Requirements:

  • GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
  • RAM: 32GB DDR4
  • Storage: ~500MB per model + ~1GB per embedded word list

Embedding Models (Currently Deployed)

Model Params Dimensions VRAM Special Requirements
all-MiniLM-L6-v2 22M 384 ~0.5GB None
BGE Large EN v1.5 335M 1024 ~2GB None

Model-Specific Notes:

  • Both models load directly from HuggingFace Hub
  • All models use sentence-transformers library for consistent interface
  • Models are lazy-loaded and cached for performance

Data Flow

User Input (reference word)
    ↓
Model Selection (dropdown)
    ↓
Lazy Load Model + Collection (if not cached)
    ↓
Encode Reference Word
    ↓
Vector Similarity Search (ChromaDB or numpy)
    ↓
Format Results with Model Label
    ↓
Display in Gradio Interface

File Structure

/home/jnalv/llm-projects/semantic-explorer/
β”œβ”€β”€ app.py                      # Main Gradio interface (MOST IMPORTANT)
β”œβ”€β”€ prepare_embeddings.py       # One-time embedding generation
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ README.md                   # User-facing documentation
β”œβ”€β”€ QUICKSTART.md              # Getting started guide
β”œβ”€β”€ DEPLOYMENT.md              # HuggingFace Spaces deployment
β”œβ”€β”€ CHANGES.md                 # Changelog
β”œβ”€β”€ documents/                 # Word lists (input data)
β”‚   └── english-words.txt      # ~100k English words
β”œβ”€β”€ models/                    # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
β”œβ”€β”€ chromadb/                  # Vector databases (one per model)
β”‚   β”œβ”€β”€ words_all_MiniLM_L6_v2/
β”‚   β”œβ”€β”€ words_bge_large_en_v1_5/
β”‚   β”œβ”€β”€ words_jina_embeddings_v4/
β”‚   β”œβ”€β”€ words_Qwen3_Embedding_0_6B/
β”‚   β”œβ”€β”€ words_embeddinggemma_300m/
β”‚   └── words_nomic_embed_text_v1_5/
└── [documentation files]

Key Implementation Details

Model Registry Pattern

Models are defined in a central registry (AVAILABLE_MODELS dict in app.py):

AVAILABLE_MODELS = {
    "model-key": {
        "name": "model-key",
        "display": "Model Name: X dimensions",  # Shown in dropdown
        "hf_id": "org/model-name",              # HuggingFace model ID
        "dimensions": 1024,                      # Embedding dimensions
        "trust_remote_code": False,             # Security flag
        "default_task": "retrieval"             # Optional: for task-specific models
    }
}

Adding a new model requires:

  1. Entry in AVAILABLE_MODELS (app.py)
  2. Entry in MODELS_REQUIRING_TRUST if needed (prepare_embeddings.py)
  3. Entry in MODELS_REQUIRING_TASK if needed (prepare_embeddings.py)
  4. Running prepare_embeddings.py to create its collection

Collection Naming Convention

Collections are auto-generated from model keys:

  • Pattern: words_{model_key} with special chars replaced by underscores
  • Example: jinaai/jina-embeddings-v4 β†’ words_jina_embeddings_v4
  • Implementation: get_collection_name() function

Lazy Loading Strategy

Models and collections are only loaded when first selected:

loaded_models = {}       # Cache: model_key β†’ SentenceTransformer
loaded_collections = {}  # Cache: model_key β†’ ChromaDB Collection

def load_model_and_collection(model_key):
    if model_key in loaded_models:
        return loaded_models[model_key], loaded_collections[model_key]
    # ... load and cache

Benefits:

  • Fast app startup (no models loaded initially)
  • Memory efficient (only load what's used)
  • Smooth model switching (cached after first load)

Similarity Computation

Most/Least Similar:

  • Retrieves all embeddings from collection
  • Computes cosine similarity with numpy
  • Sorts and returns top n / bottom n

Target Score:

  • Finds words within tolerance of target score
  • If not enough exact matches, shows "next-nearest"
  • Random sampling if too many exact matches

Comparison Tool:

  • Encodes reference + comparison words in single batch
  • Computes pairwise similarities
  • Sorts by similarity (descending)

Critical Code Patterns

Trust Remote Code Handling

Some models require custom code execution:

# In model registry
"trust_remote_code": True

# In loading code
model = SentenceTransformer(
    model_id,
    trust_remote_code=model_info.get("trust_remote_code", False)
)

Models requiring this: Jina v4, Nomic v1.5

Task-Specific Model Handling

Jina v4 uses LoRA adapters and requires a task:

# In model registry
"default_task": "retrieval"

# In loading code
model_kwargs = {}
if "default_task" in model_info:
    model_kwargs["default_task"] = model_info["default_task"]

model = SentenceTransformer(
    model_id,
    model_kwargs=model_kwargs if model_kwargs else None
)

Gradio Tab Structure

Interface uses tabs for different modes:

with gr.Blocks() as app:
    model_selector = gr.Radio(choices=[...])  # Top-level selector with radio buttons
    
    with gr.Tabs():
        with gr.Tab("Comparison Tool"):      # Tab 1
            # ... comparison UI (up to 10 words)
        with gr.Tab("Most & Least Similar"): # Tab 2
            # ... shows 20 most + 20 least similar words

Design rationale:

  • Comparison Tool first (most direct use case)
  • Combined Most & Least (better for mobile, shows contrast)
  • Target Score removed (didn't function as needed)
  • Radio buttons for model selection (cleaner with only 2 models)
  • Fixed 20 results in Most & Least (no slider needed)

Development History

Model Evolution

Removed Models:

  • E5 Large v2 β†’ "clumpy" semantic clustering
  • Qwen2.5 Embedding 7B β†’ OOM errors (14GB VRAM)
  • Jina Embeddings v3 β†’ Superseded by v4

Current Models (2 deployed on HF Spaces):

  • all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
  • BGE Large EN v1.5: Larger, more capable (1024 dimensions)
  • Both models chosen for quality and HF Spaces compatibility

Interface Evolution

Original Design:

  • 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
  • Model name shown at bottom
  • Dropdown for model selection

Previous Design (6 models):

  • 3 tabs (Comparison, Most & Least, Target Score)
  • Model dropdown at top
  • Slider for result count

Current Design (Production on HF Spaces):

  • 2 tabs (Comparison, Most & Least Similar)
  • Radio buttons for model selection
  • Fixed 20 results in Most & Least Similar
  • Model name shown in each result
  • Streamlined, mobile-optimized interface

Deployment Status (HuggingFace Spaces)

Current Deployment βœ…

Live on HuggingFace Spaces:

  • URL: hf.co/spaces/jnalv/semantic-explorer
  • 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
  • ChromaDB databases pre-computed and uploaded
  • Models load directly from HuggingFace Hub (no local paths)
  • Lazy loading implemented for optimal cold-start performance

Deployment Configuration:

  1. app.py - Main application file (HF Spaces runs this automatically)
  2. Models selected for HF Spaces storage limits
  3. Embeddings pre-computed locally and uploaded
  4. ChromaDB databases included in repository

README.md metadata (in production): ```yaml

title: Semantic Word Explorer emoji: πŸ” colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false


### Deployment Notes

**Storage:**
- ChromaDB databases (~1GB each for 2 models)
- Managed within HF Spaces storage limits

**Performance:**
- Models lazy-load on first selection
- Cached after first load for fast switching
- Cold start: ~30-60 seconds per model (acceptable for free tier)

**Resource Management:**
- 2 models fit comfortably in HF Spaces memory
- Lazy loading ensures efficient resource usage

---

## Developer Notes

### Josh's Preferences

**Code style:**
- Practical over theoretical
- Working prototypes over perfect architecture
- Remove features that don't add clear value
- Streamlined, mobile-friendly interfaces
- No excessive explanation in UI

**Development approach:**
- Iterative refinement based on testing
- Build complete prototypes to understand concepts
- Modular structure (separate projects, shared resources)
- Git hygiene with virtual environments

### Testing Workflow

```bash
# Local testing
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer

# Prepare embeddings for a model
python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model BAAI/bge-large-en-v1.5

# Launch app
python app.py

# Access at http://localhost:7860

Common Issues & Solutions

Issue: Model not loading
Solution: Check that prepare_embeddings.py was run for that model

Issue: Collection not found
Solution: Collection names auto-generated; verify with ChromaDB

Issue: OOM during embedding generation
Solution: Reduce --batch-size parameter

Issue: Jina v4 task error
Solution: Ensure default_task is set in model registry


AI Agent Instructions

When working with this codebase:

  1. Adding new models: Update AVAILABLE_MODELS in app.py, add trust/task requirements if needed
  2. Modifying UI: Edit Gradio components in create_interface() function
  3. Changing similarity logic: Modify functions in app.py (find_most_and_least_similar, etc.)
  4. Deployment prep: Focus on app_hf_spaces.py and adapt for Spaces environment
  5. Documentation: Update relevant .md files when making significant changes

Code Modification Guidelines

  • Preserve the lazy loading pattern (don't load all models at startup)
  • Maintain the model registry structure (central source of truth)
  • Keep model labels in results (user feedback requirement)
  • Test with multiple models before committing
  • Update requirements.txt for new dependencies

Files You'll Most Often Edit

  1. app.py - Main application logic (deployed to HF Spaces)
  2. prepare_embeddings.py - Embedding generation (local only, for adding new models)
  3. requirements.txt - Dependencies
  4. README.md / QUICKSTART.md - User documentation

Important Files

Production:

  • app.py - Single app file (loads models from HuggingFace Hub)
  • chromadb/ - Pre-computed embeddings for deployed models
  • requirements.txt - Dependencies for HF Spaces

Local Development:

  • prepare_embeddings.py - Generate embeddings locally before deployment
  • documents/english-words.txt - Word list (~100k words)

Files You Shouldn't Usually Edit

  • chromadb/ - Generated data
  • documents/ - Input data
  • models/ - Downloaded models
  • Backup files (*.backup)

Quick Reference

Run Locally

source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
python app.py

Prepare New Model

python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model <model-hf-id>

Check Available Models

# Look at AVAILABLE_MODELS in app.py
# or run:
python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"

Test Specific Model

# Launch app, select model from dropdown, test each tab

Related Documentation

  • QUICKSTART.md - Getting started guide
  • README.md - Comprehensive user documentation
  • DEPLOYMENT.md - HuggingFace Spaces deployment (original)
  • HUGGINGFACE_SPACES_WORKFLOW.md - Git workflow for Spaces
  • CHANGES.md - Change history
  • NEW_MODELS.md - Latest model additions
  • JINA_V4_TASK_FIX.md - Jina v4 task requirement fix

Version Information

Current State: Live on HuggingFace Spaces
Deployment Date: January 2025
Last Major Update: January 2025 - Streamlined to 2 tabs, radio buttons, 2 models
Python Version: 3.12
Gradio Version: 4.0.0+
Deployed Models: 2 embedding models (384 and 1024 dimensions)


END OF PROJECT DOCUMENTATION