Spaces:

jnalv
/

semantic-explorer

Paused

App Files Files Community

semantic-explorer / PROJECT.md

jnalv

Cleanup: remove app_hf_spaces.py, update PROJECT.md to reflect current deployed state

87f87b4 about 2 months ago

preview code

raw

history blame contribute delete

15.2 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

Semantic Explorer - Project Documentation

Project Type: Gradio Web Application
Target Platform: Hugging Face Spaces (Gradio SDK)
Development Environment: Ubuntu Linux with Cursor IDE
Python Version: 3.12
Primary Developer: Josh (jnalv)

Project Overview

What This App Does

Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:

Compare embeddings across models - Select from 2 different embedding models to see how they represent semantic meaning
Find similar and dissimilar words - Discover the 20 most and 20 least semantically similar words to a reference word
Compare words directly - Get similarity scores between a reference word and up to 10 comparison words

Core Functionality

The app operates in two phases:

Phase 1: Preparation (Local, one-time per model)

User runs prepare_embeddings.py with a word list and chosen embedding model
Script generates embeddings for all words (typically 100k+ words)
Embeddings stored in ChromaDB vector database
Each model gets its own collection (e.g., words_bge_large_en_v1_5)

Phase 2: Exploration (Web interface)

User launches Gradio app via app.py (deployed on HuggingFace Spaces)
Selects embedding model from radio buttons
Models and collections lazy-load on demand (cached after first load)
Performs similarity searches and comparisons in real-time

Primary Use Case

Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.

Project Goals

Immediate Goals

Hugging Face Spaces Deployment ✅ COMPLETE
- App deployed and live on HuggingFace Spaces
- 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
- Models load directly from HuggingFace Hub
- ChromaDB databases pre-computed and uploaded
- Lazy loading and caching implemented for optimal performance
User Experience Polish ✅ COMPLETE
- Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar (no slider)
- Clear model labels in all results
- Mobile-responsive design

Long-Term Goals

Support for custom word lists via file upload
Visualization of embedding spaces (t-SNE/UMAP)
Export similarity results as CSV
Multi-language support beyond English
API endpoint for programmatic access

Technical Architecture

Technology Stack

Core Dependencies:

gradio>=4.0.0              # Web interface framework
chromadb>=0.4.0            # Vector database
sentence-transformers>=2.2.0  # Embedding model interface
numpy>=1.24.0              # Numerical operations
torch>=2.0.0               # PyTorch backend
einops>=0.6.0              # Tensor operations (Jina v4)
peft>=0.7.0                # LoRA adapters (Jina v4)
tqdm>=4.65.0               # Progress bars

Hardware Requirements:

GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
RAM: 32GB DDR4
Storage: ~500MB per model + ~1GB per embedded word list

Embedding Models (Currently Deployed)

Model	Params	Dimensions	VRAM	Special Requirements
all-MiniLM-L6-v2	22M	384	~0.5GB	None
BGE Large EN v1.5	335M	1024	~2GB	None

Model-Specific Notes:

Both models load directly from HuggingFace Hub
All models use sentence-transformers library for consistent interface
Models are lazy-loaded and cached for performance

Data Flow

User Input (reference word)
    ↓
Model Selection (dropdown)
    ↓
Lazy Load Model + Collection (if not cached)
    ↓
Encode Reference Word
    ↓
Vector Similarity Search (ChromaDB or numpy)
    ↓
Format Results with Model Label
    ↓
Display in Gradio Interface

File Structure

/home/jnalv/llm-projects/semantic-explorer/
├── app.py                      # Main Gradio interface (MOST IMPORTANT)
├── prepare_embeddings.py       # One-time embedding generation
├── requirements.txt            # Python dependencies
├── README.md                   # User-facing documentation
├── QUICKSTART.md              # Getting started guide
├── DEPLOYMENT.md              # HuggingFace Spaces deployment
├── CHANGES.md                 # Changelog
├── documents/                 # Word lists (input data)
│   └── english-words.txt      # ~100k English words
├── models/                    # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
├── chromadb/                  # Vector databases (one per model)
│   ├── words_all_MiniLM_L6_v2/
│   ├── words_bge_large_en_v1_5/
│   ├── words_jina_embeddings_v4/
│   ├── words_Qwen3_Embedding_0_6B/
│   ├── words_embeddinggemma_300m/
│   └── words_nomic_embed_text_v1_5/
└── [documentation files]

Key Implementation Details

Model Registry Pattern

Models are defined in a central registry (AVAILABLE_MODELS dict in app.py):

AVAILABLE_MODELS = {
    "model-key": {
        "name": "model-key",
        "display": "Model Name: X dimensions",  # Shown in dropdown
        "hf_id": "org/model-name",              # HuggingFace model ID
        "dimensions": 1024,                      # Embedding dimensions
        "trust_remote_code": False,             # Security flag
        "default_task": "retrieval"             # Optional: for task-specific models
    }
}

Adding a new model requires:

Entry in AVAILABLE_MODELS (app.py)
Entry in MODELS_REQUIRING_TRUST if needed (prepare_embeddings.py)
Entry in MODELS_REQUIRING_TASK if needed (prepare_embeddings.py)
Running prepare_embeddings.py to create its collection

Collection Naming Convention

Collections are auto-generated from model keys:

Pattern: words_{model_key} with special chars replaced by underscores
Example: jinaai/jina-embeddings-v4 → words_jina_embeddings_v4
Implementation: get_collection_name() function

Lazy Loading Strategy

Models and collections are only loaded when first selected:

loaded_models = {}       # Cache: model_key → SentenceTransformer
loaded_collections = {}  # Cache: model_key → ChromaDB Collection

def load_model_and_collection(model_key):
    if model_key in loaded_models:
        return loaded_models[model_key], loaded_collections[model_key]
    # ... load and cache

Benefits:

Fast app startup (no models loaded initially)
Memory efficient (only load what's used)
Smooth model switching (cached after first load)

Similarity Computation

Most/Least Similar:

Retrieves all embeddings from collection
Computes cosine similarity with numpy
Sorts and returns top n / bottom n

Target Score:

Finds words within tolerance of target score
If not enough exact matches, shows "next-nearest"
Random sampling if too many exact matches

Comparison Tool:

Encodes reference + comparison words in single batch
Computes pairwise similarities
Sorts by similarity (descending)

Critical Code Patterns

Trust Remote Code Handling

Some models require custom code execution:

# In model registry
"trust_remote_code": True

# In loading code
model = SentenceTransformer(
    model_id,
    trust_remote_code=model_info.get("trust_remote_code", False)
)

Models requiring this: Jina v4, Nomic v1.5

Task-Specific Model Handling

Jina v4 uses LoRA adapters and requires a task:

# In model registry
"default_task": "retrieval"

# In loading code
model_kwargs = {}
if "default_task" in model_info:
    model_kwargs["default_task"] = model_info["default_task"]

model = SentenceTransformer(
    model_id,
    model_kwargs=model_kwargs if model_kwargs else None
)

Gradio Tab Structure

Interface uses tabs for different modes:

with gr.Blocks() as app:
    model_selector = gr.Radio(choices=[...])  # Top-level selector with radio buttons
    
    with gr.Tabs():
        with gr.Tab("Comparison Tool"):      # Tab 1
            # ... comparison UI (up to 10 words)
        with gr.Tab("Most & Least Similar"): # Tab 2
            # ... shows 20 most + 20 least similar words

Design rationale:

Comparison Tool first (most direct use case)
Combined Most & Least (better for mobile, shows contrast)
Target Score removed (didn't function as needed)
Radio buttons for model selection (cleaner with only 2 models)
Fixed 20 results in Most & Least (no slider needed)

Development History

Model Evolution

Removed Models:

E5 Large v2 → "clumpy" semantic clustering
Qwen2.5 Embedding 7B → OOM errors (14GB VRAM)
Jina Embeddings v3 → Superseded by v4

Current Models (2 deployed on HF Spaces):

all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
BGE Large EN v1.5: Larger, more capable (1024 dimensions)
Both models chosen for quality and HF Spaces compatibility

Interface Evolution

Original Design:

5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
Model name shown at bottom
Dropdown for model selection

Previous Design (6 models):

3 tabs (Comparison, Most & Least, Target Score)
Model dropdown at top
Slider for result count

Current Design (Production on HF Spaces):

2 tabs (Comparison, Most & Least Similar)
Radio buttons for model selection
Fixed 20 results in Most & Least Similar
Model name shown in each result
Streamlined, mobile-optimized interface

Deployment Status (HuggingFace Spaces)

Current Deployment ✅

Live on HuggingFace Spaces:

URL: hf.co/spaces/jnalv/semantic-explorer
2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
ChromaDB databases pre-computed and uploaded
Models load directly from HuggingFace Hub (no local paths)
Lazy loading implemented for optimal cold-start performance

Deployment Configuration:

app.py - Main application file (HF Spaces runs this automatically)
Models selected for HF Spaces storage limits
Embeddings pre-computed locally and uploaded
ChromaDB databases included in repository

README.md metadata (in production): ```yaml

title: Semantic Word Explorer emoji: 🔍 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false


### Deployment Notes

**Storage:**
- ChromaDB databases (~1GB each for 2 models)
- Managed within HF Spaces storage limits

**Performance:**
- Models lazy-load on first selection
- Cached after first load for fast switching
- Cold start: ~30-60 seconds per model (acceptable for free tier)

**Resource Management:**
- 2 models fit comfortably in HF Spaces memory
- Lazy loading ensures efficient resource usage

---

## Developer Notes

### Josh's Preferences

**Code style:**
- Practical over theoretical
- Working prototypes over perfect architecture
- Remove features that don't add clear value
- Streamlined, mobile-friendly interfaces
- No excessive explanation in UI

**Development approach:**
- Iterative refinement based on testing
- Build complete prototypes to understand concepts
- Modular structure (separate projects, shared resources)
- Git hygiene with virtual environments

### Testing Workflow

```bash
# Local testing
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer

# Prepare embeddings for a model
python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model BAAI/bge-large-en-v1.5

# Launch app
python app.py

# Access at http://localhost:7860

Common Issues & Solutions

Issue: Model not loading
Solution: Check that prepare_embeddings.py was run for that model

Issue: Collection not found
Solution: Collection names auto-generated; verify with ChromaDB

Issue: OOM during embedding generation
Solution: Reduce --batch-size parameter

Issue: Jina v4 task error
Solution: Ensure default_task is set in model registry

AI Agent Instructions

When working with this codebase:

Adding new models: Update AVAILABLE_MODELS in app.py, add trust/task requirements if needed
Modifying UI: Edit Gradio components in create_interface() function
Changing similarity logic: Modify functions in app.py (find_most_and_least_similar, etc.)
Deployment prep: Focus on app_hf_spaces.py and adapt for Spaces environment
Documentation: Update relevant .md files when making significant changes

Code Modification Guidelines

Preserve the lazy loading pattern (don't load all models at startup)
Maintain the model registry structure (central source of truth)
Keep model labels in results (user feedback requirement)
Test with multiple models before committing
Update requirements.txt for new dependencies

Files You'll Most Often Edit

app.py - Main application logic (deployed to HF Spaces)
prepare_embeddings.py - Embedding generation (local only, for adding new models)
requirements.txt - Dependencies
README.md / QUICKSTART.md - User documentation

Important Files

Production:

app.py - Single app file (loads models from HuggingFace Hub)
chromadb/ - Pre-computed embeddings for deployed models
requirements.txt - Dependencies for HF Spaces

Local Development:

prepare_embeddings.py - Generate embeddings locally before deployment
documents/english-words.txt - Word list (~100k words)

Files You Shouldn't Usually Edit

chromadb/ - Generated data
documents/ - Input data
models/ - Downloaded models
Backup files (*.backup)

Quick Reference

Run Locally

source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
python app.py

Prepare New Model

python prepare_embeddings.py \
    --wordlist english-words.txt \
    --model <model-hf-id>

Check Available Models

# Look at AVAILABLE_MODELS in app.py
# or run:
python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"

Test Specific Model

# Launch app, select model from dropdown, test each tab

Version Information

Current State: Live on HuggingFace Spaces
Deployment Date: January 2025
Last Major Update: January 2025 - Streamlined to 2 tabs, radio buttons, 2 models
Python Version: 3.12
Gradio Version: 4.0.0+
Deployed Models: 2 embedding models (384 and 1024 dimensions)

END OF PROJECT DOCUMENTATION