Spaces:
Paused
A newer version of the Gradio SDK is available:
6.8.0
Semantic Explorer - Project Documentation
Project Type: Gradio Web Application
Target Platform: Hugging Face Spaces (Gradio SDK)
Development Environment: Ubuntu Linux with Cursor IDE
Python Version: 3.12
Primary Developer: Josh (jnalv)
Project Overview
What This App Does
Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:
- Compare embeddings across models - Select from 2 different embedding models to see how they represent semantic meaning
- Find similar and dissimilar words - Discover the 20 most and 20 least semantically similar words to a reference word
- Compare words directly - Get similarity scores between a reference word and up to 10 comparison words
Core Functionality
The app operates in two phases:
Phase 1: Preparation (Local, one-time per model)
- User runs
prepare_embeddings.pywith a word list and chosen embedding model - Script generates embeddings for all words (typically 100k+ words)
- Embeddings stored in ChromaDB vector database
- Each model gets its own collection (e.g.,
words_bge_large_en_v1_5)
Phase 2: Exploration (Web interface)
- User launches Gradio app via
app.py(deployed on HuggingFace Spaces) - Selects embedding model from radio buttons
- Models and collections lazy-load on demand (cached after first load)
- Performs similarity searches and comparisons in real-time
Primary Use Case
Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.
Project Goals
Immediate Goals
Hugging Face Spaces Deployment β COMPLETE
- App deployed and live on HuggingFace Spaces
- 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
- Models load directly from HuggingFace Hub
- ChromaDB databases pre-computed and uploaded
- Lazy loading and caching implemented for optimal performance
User Experience Polish β COMPLETE
- Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar (no slider)
- Clear model labels in all results
- Mobile-responsive design
Long-Term Goals
- Support for custom word lists via file upload
- Visualization of embedding spaces (t-SNE/UMAP)
- Export similarity results as CSV
- Multi-language support beyond English
- API endpoint for programmatic access
Technical Architecture
Technology Stack
Core Dependencies:
gradio>=4.0.0 # Web interface framework
chromadb>=0.4.0 # Vector database
sentence-transformers>=2.2.0 # Embedding model interface
numpy>=1.24.0 # Numerical operations
torch>=2.0.0 # PyTorch backend
einops>=0.6.0 # Tensor operations (Jina v4)
peft>=0.7.0 # LoRA adapters (Jina v4)
tqdm>=4.65.0 # Progress bars
Hardware Requirements:
- GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
- RAM: 32GB DDR4
- Storage: ~500MB per model + ~1GB per embedded word list
Embedding Models (Currently Deployed)
| Model | Params | Dimensions | VRAM | Special Requirements |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 384 | ~0.5GB | None |
| BGE Large EN v1.5 | 335M | 1024 | ~2GB | None |
Model-Specific Notes:
- Both models load directly from HuggingFace Hub
- All models use sentence-transformers library for consistent interface
- Models are lazy-loaded and cached for performance
Data Flow
User Input (reference word)
β
Model Selection (dropdown)
β
Lazy Load Model + Collection (if not cached)
β
Encode Reference Word
β
Vector Similarity Search (ChromaDB or numpy)
β
Format Results with Model Label
β
Display in Gradio Interface
File Structure
/home/jnalv/llm-projects/semantic-explorer/
βββ app.py # Main Gradio interface (MOST IMPORTANT)
βββ prepare_embeddings.py # One-time embedding generation
βββ requirements.txt # Python dependencies
βββ README.md # User-facing documentation
βββ QUICKSTART.md # Getting started guide
βββ DEPLOYMENT.md # HuggingFace Spaces deployment
βββ CHANGES.md # Changelog
βββ documents/ # Word lists (input data)
β βββ english-words.txt # ~100k English words
βββ models/ # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
βββ chromadb/ # Vector databases (one per model)
β βββ words_all_MiniLM_L6_v2/
β βββ words_bge_large_en_v1_5/
β βββ words_jina_embeddings_v4/
β βββ words_Qwen3_Embedding_0_6B/
β βββ words_embeddinggemma_300m/
β βββ words_nomic_embed_text_v1_5/
βββ [documentation files]
Key Implementation Details
Model Registry Pattern
Models are defined in a central registry (AVAILABLE_MODELS dict in app.py):
AVAILABLE_MODELS = {
"model-key": {
"name": "model-key",
"display": "Model Name: X dimensions", # Shown in dropdown
"hf_id": "org/model-name", # HuggingFace model ID
"dimensions": 1024, # Embedding dimensions
"trust_remote_code": False, # Security flag
"default_task": "retrieval" # Optional: for task-specific models
}
}
Adding a new model requires:
- Entry in
AVAILABLE_MODELS(app.py) - Entry in
MODELS_REQUIRING_TRUSTif needed (prepare_embeddings.py) - Entry in
MODELS_REQUIRING_TASKif needed (prepare_embeddings.py) - Running
prepare_embeddings.pyto create its collection
Collection Naming Convention
Collections are auto-generated from model keys:
- Pattern:
words_{model_key}with special chars replaced by underscores - Example:
jinaai/jina-embeddings-v4βwords_jina_embeddings_v4 - Implementation:
get_collection_name()function
Lazy Loading Strategy
Models and collections are only loaded when first selected:
loaded_models = {} # Cache: model_key β SentenceTransformer
loaded_collections = {} # Cache: model_key β ChromaDB Collection
def load_model_and_collection(model_key):
if model_key in loaded_models:
return loaded_models[model_key], loaded_collections[model_key]
# ... load and cache
Benefits:
- Fast app startup (no models loaded initially)
- Memory efficient (only load what's used)
- Smooth model switching (cached after first load)
Similarity Computation
Most/Least Similar:
- Retrieves all embeddings from collection
- Computes cosine similarity with numpy
- Sorts and returns top n / bottom n
Target Score:
- Finds words within tolerance of target score
- If not enough exact matches, shows "next-nearest"
- Random sampling if too many exact matches
Comparison Tool:
- Encodes reference + comparison words in single batch
- Computes pairwise similarities
- Sorts by similarity (descending)
Critical Code Patterns
Trust Remote Code Handling
Some models require custom code execution:
# In model registry
"trust_remote_code": True
# In loading code
model = SentenceTransformer(
model_id,
trust_remote_code=model_info.get("trust_remote_code", False)
)
Models requiring this: Jina v4, Nomic v1.5
Task-Specific Model Handling
Jina v4 uses LoRA adapters and requires a task:
# In model registry
"default_task": "retrieval"
# In loading code
model_kwargs = {}
if "default_task" in model_info:
model_kwargs["default_task"] = model_info["default_task"]
model = SentenceTransformer(
model_id,
model_kwargs=model_kwargs if model_kwargs else None
)
Gradio Tab Structure
Interface uses tabs for different modes:
with gr.Blocks() as app:
model_selector = gr.Radio(choices=[...]) # Top-level selector with radio buttons
with gr.Tabs():
with gr.Tab("Comparison Tool"): # Tab 1
# ... comparison UI (up to 10 words)
with gr.Tab("Most & Least Similar"): # Tab 2
# ... shows 20 most + 20 least similar words
Design rationale:
- Comparison Tool first (most direct use case)
- Combined Most & Least (better for mobile, shows contrast)
- Target Score removed (didn't function as needed)
- Radio buttons for model selection (cleaner with only 2 models)
- Fixed 20 results in Most & Least (no slider needed)
Development History
Model Evolution
Removed Models:
- E5 Large v2 β "clumpy" semantic clustering
- Qwen2.5 Embedding 7B β OOM errors (14GB VRAM)
- Jina Embeddings v3 β Superseded by v4
Current Models (2 deployed on HF Spaces):
- all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
- BGE Large EN v1.5: Larger, more capable (1024 dimensions)
- Both models chosen for quality and HF Spaces compatibility
Interface Evolution
Original Design:
- 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
- Model name shown at bottom
- Dropdown for model selection
Previous Design (6 models):
- 3 tabs (Comparison, Most & Least, Target Score)
- Model dropdown at top
- Slider for result count
Current Design (Production on HF Spaces):
- 2 tabs (Comparison, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar
- Model name shown in each result
- Streamlined, mobile-optimized interface
Deployment Status (HuggingFace Spaces)
Current Deployment β
Live on HuggingFace Spaces:
- URL: hf.co/spaces/jnalv/semantic-explorer
- 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
- ChromaDB databases pre-computed and uploaded
- Models load directly from HuggingFace Hub (no local paths)
- Lazy loading implemented for optimal cold-start performance
Deployment Configuration:
app.py- Main application file (HF Spaces runs this automatically)- Models selected for HF Spaces storage limits
- Embeddings pre-computed locally and uploaded
- ChromaDB databases included in repository
README.md metadata (in production): ```yaml
title: Semantic Word Explorer emoji: π colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false
### Deployment Notes
**Storage:**
- ChromaDB databases (~1GB each for 2 models)
- Managed within HF Spaces storage limits
**Performance:**
- Models lazy-load on first selection
- Cached after first load for fast switching
- Cold start: ~30-60 seconds per model (acceptable for free tier)
**Resource Management:**
- 2 models fit comfortably in HF Spaces memory
- Lazy loading ensures efficient resource usage
---
## Developer Notes
### Josh's Preferences
**Code style:**
- Practical over theoretical
- Working prototypes over perfect architecture
- Remove features that don't add clear value
- Streamlined, mobile-friendly interfaces
- No excessive explanation in UI
**Development approach:**
- Iterative refinement based on testing
- Build complete prototypes to understand concepts
- Modular structure (separate projects, shared resources)
- Git hygiene with virtual environments
### Testing Workflow
```bash
# Local testing
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
# Prepare embeddings for a model
python prepare_embeddings.py \
--wordlist english-words.txt \
--model BAAI/bge-large-en-v1.5
# Launch app
python app.py
# Access at http://localhost:7860
Common Issues & Solutions
Issue: Model not loading
Solution: Check that prepare_embeddings.py was run for that model
Issue: Collection not found
Solution: Collection names auto-generated; verify with ChromaDB
Issue: OOM during embedding generation
Solution: Reduce --batch-size parameter
Issue: Jina v4 task error
Solution: Ensure default_task is set in model registry
AI Agent Instructions
When working with this codebase:
- Adding new models: Update
AVAILABLE_MODELSin app.py, add trust/task requirements if needed - Modifying UI: Edit Gradio components in
create_interface()function - Changing similarity logic: Modify functions in app.py (
find_most_and_least_similar, etc.) - Deployment prep: Focus on
app_hf_spaces.pyand adapt for Spaces environment - Documentation: Update relevant .md files when making significant changes
Code Modification Guidelines
- Preserve the lazy loading pattern (don't load all models at startup)
- Maintain the model registry structure (central source of truth)
- Keep model labels in results (user feedback requirement)
- Test with multiple models before committing
- Update requirements.txt for new dependencies
Files You'll Most Often Edit
app.py- Main application logic (deployed to HF Spaces)prepare_embeddings.py- Embedding generation (local only, for adding new models)requirements.txt- DependenciesREADME.md/QUICKSTART.md- User documentation
Important Files
Production:
app.py- Single app file (loads models from HuggingFace Hub)chromadb/- Pre-computed embeddings for deployed modelsrequirements.txt- Dependencies for HF Spaces
Local Development:
prepare_embeddings.py- Generate embeddings locally before deploymentdocuments/english-words.txt- Word list (~100k words)
Files You Shouldn't Usually Edit
chromadb/- Generated datadocuments/- Input datamodels/- Downloaded models- Backup files (*.backup)
Quick Reference
Run Locally
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
python app.py
Prepare New Model
python prepare_embeddings.py \
--wordlist english-words.txt \
--model <model-hf-id>
Check Available Models
# Look at AVAILABLE_MODELS in app.py
# or run:
python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"
Test Specific Model
# Launch app, select model from dropdown, test each tab
Related Documentation
QUICKSTART.md- Getting started guideREADME.md- Comprehensive user documentationDEPLOYMENT.md- HuggingFace Spaces deployment (original)HUGGINGFACE_SPACES_WORKFLOW.md- Git workflow for SpacesCHANGES.md- Change historyNEW_MODELS.md- Latest model additionsJINA_V4_TASK_FIX.md- Jina v4 task requirement fix
Version Information
Current State: Live on HuggingFace Spaces
Deployment Date: January 2025
Last Major Update: January 2025 - Streamlined to 2 tabs, radio buttons, 2 models
Python Version: 3.12
Gradio Version: 4.0.0+
Deployed Models: 2 embedding models (384 and 1024 dimensions)
END OF PROJECT DOCUMENTATION