# Semantic Explorer - Project Documentation **Project Type:** Gradio Web Application **Target Platform:** Hugging Face Spaces (Gradio SDK) **Development Environment:** Ubuntu Linux with Cursor IDE **Python Version:** 3.12 **Primary Developer:** Josh (jnalv) --- ## Project Overview ### What This App Does Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can: 1. **Compare embeddings across models** - Select from 2 different embedding models to see how they represent semantic meaning 2. **Find similar and dissimilar words** - Discover the 20 most and 20 least semantically similar words to a reference word 3. **Compare words directly** - Get similarity scores between a reference word and up to 10 comparison words ### Core Functionality The app operates in two phases: **Phase 1: Preparation (Local, one-time per model)** - User runs `prepare_embeddings.py` with a word list and chosen embedding model - Script generates embeddings for all words (typically 100k+ words) - Embeddings stored in ChromaDB vector database - Each model gets its own collection (e.g., `words_bge_large_en_v1_5`) **Phase 2: Exploration (Web interface)** - User launches Gradio app via `app.py` (deployed on HuggingFace Spaces) - Selects embedding model from radio buttons - Models and collections lazy-load on demand (cached after first load) - Performs similarity searches and comparisons in real-time ### Primary Use Case Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language. --- ## Project Goals ### Immediate Goals 1. **Hugging Face Spaces Deployment** ✅ COMPLETE - App deployed and live on HuggingFace Spaces - 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5) - Models load directly from HuggingFace Hub - ChromaDB databases pre-computed and uploaded - Lazy loading and caching implemented for optimal performance 2. **User Experience Polish** ✅ COMPLETE - Streamlined 2-tab interface (Comparison Tool, Most & Least Similar) - Radio buttons for model selection - Fixed 20 results in Most & Least Similar (no slider) - Clear model labels in all results - Mobile-responsive design ### Long-Term Goals - Support for custom word lists via file upload - Visualization of embedding spaces (t-SNE/UMAP) - Export similarity results as CSV - Multi-language support beyond English - API endpoint for programmatic access --- ## Technical Architecture ### Technology Stack **Core Dependencies:** ``` gradio>=4.0.0 # Web interface framework chromadb>=0.4.0 # Vector database sentence-transformers>=2.2.0 # Embedding model interface numpy>=1.24.0 # Numerical operations torch>=2.0.0 # PyTorch backend einops>=0.6.0 # Tensor operations (Jina v4) peft>=0.7.0 # LoRA adapters (Jina v4) tqdm>=4.65.0 # Progress bars ``` **Hardware Requirements:** - GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system) - RAM: 32GB DDR4 - Storage: ~500MB per model + ~1GB per embedded word list ### Embedding Models (Currently Deployed) | Model | Params | Dimensions | VRAM | Special Requirements | |-------|--------|-----------|------|---------------------| | all-MiniLM-L6-v2 | 22M | 384 | ~0.5GB | None | | BGE Large EN v1.5 | 335M | 1024 | ~2GB | None | **Model-Specific Notes:** - Both models load directly from HuggingFace Hub - All models use sentence-transformers library for consistent interface - Models are lazy-loaded and cached for performance ### Data Flow ``` User Input (reference word) ↓ Model Selection (dropdown) ↓ Lazy Load Model + Collection (if not cached) ↓ Encode Reference Word ↓ Vector Similarity Search (ChromaDB or numpy) ↓ Format Results with Model Label ↓ Display in Gradio Interface ``` ### File Structure ``` /home/jnalv/llm-projects/semantic-explorer/ ├── app.py # Main Gradio interface (MOST IMPORTANT) ├── prepare_embeddings.py # One-time embedding generation ├── requirements.txt # Python dependencies ├── README.md # User-facing documentation ├── QUICKSTART.md # Getting started guide ├── DEPLOYMENT.md # HuggingFace Spaces deployment ├── CHANGES.md # Changelog ├── documents/ # Word lists (input data) │ └── english-words.txt # ~100k English words ├── models/ # Shared embedding models (symlink to /home/jnalv/llm-projects/models) ├── chromadb/ # Vector databases (one per model) │ ├── words_all_MiniLM_L6_v2/ │ ├── words_bge_large_en_v1_5/ │ ├── words_jina_embeddings_v4/ │ ├── words_Qwen3_Embedding_0_6B/ │ ├── words_embeddinggemma_300m/ │ └── words_nomic_embed_text_v1_5/ └── [documentation files] ``` --- ## Key Implementation Details ### Model Registry Pattern Models are defined in a central registry (`AVAILABLE_MODELS` dict in `app.py`): ```python AVAILABLE_MODELS = { "model-key": { "name": "model-key", "display": "Model Name: X dimensions", # Shown in dropdown "hf_id": "org/model-name", # HuggingFace model ID "dimensions": 1024, # Embedding dimensions "trust_remote_code": False, # Security flag "default_task": "retrieval" # Optional: for task-specific models } } ``` **Adding a new model requires:** 1. Entry in `AVAILABLE_MODELS` (app.py) 2. Entry in `MODELS_REQUIRING_TRUST` if needed (prepare_embeddings.py) 3. Entry in `MODELS_REQUIRING_TASK` if needed (prepare_embeddings.py) 4. Running `prepare_embeddings.py` to create its collection ### Collection Naming Convention Collections are auto-generated from model keys: - Pattern: `words_{model_key}` with special chars replaced by underscores - Example: `jinaai/jina-embeddings-v4` → `words_jina_embeddings_v4` - Implementation: `get_collection_name()` function ### Lazy Loading Strategy Models and collections are only loaded when first selected: ```python loaded_models = {} # Cache: model_key → SentenceTransformer loaded_collections = {} # Cache: model_key → ChromaDB Collection def load_model_and_collection(model_key): if model_key in loaded_models: return loaded_models[model_key], loaded_collections[model_key] # ... load and cache ``` **Benefits:** - Fast app startup (no models loaded initially) - Memory efficient (only load what's used) - Smooth model switching (cached after first load) ### Similarity Computation **Most/Least Similar:** - Retrieves all embeddings from collection - Computes cosine similarity with numpy - Sorts and returns top n / bottom n **Target Score:** - Finds words within tolerance of target score - If not enough exact matches, shows "next-nearest" - Random sampling if too many exact matches **Comparison Tool:** - Encodes reference + comparison words in single batch - Computes pairwise similarities - Sorts by similarity (descending) --- ## Critical Code Patterns ### Trust Remote Code Handling Some models require custom code execution: ```python # In model registry "trust_remote_code": True # In loading code model = SentenceTransformer( model_id, trust_remote_code=model_info.get("trust_remote_code", False) ) ``` **Models requiring this:** Jina v4, Nomic v1.5 ### Task-Specific Model Handling Jina v4 uses LoRA adapters and requires a task: ```python # In model registry "default_task": "retrieval" # In loading code model_kwargs = {} if "default_task" in model_info: model_kwargs["default_task"] = model_info["default_task"] model = SentenceTransformer( model_id, model_kwargs=model_kwargs if model_kwargs else None ) ``` ### Gradio Tab Structure Interface uses tabs for different modes: ```python with gr.Blocks() as app: model_selector = gr.Radio(choices=[...]) # Top-level selector with radio buttons with gr.Tabs(): with gr.Tab("Comparison Tool"): # Tab 1 # ... comparison UI (up to 10 words) with gr.Tab("Most & Least Similar"): # Tab 2 # ... shows 20 most + 20 least similar words ``` **Design rationale:** - Comparison Tool first (most direct use case) - Combined Most & Least (better for mobile, shows contrast) - Target Score removed (didn't function as needed) - Radio buttons for model selection (cleaner with only 2 models) - Fixed 20 results in Most & Least (no slider needed) --- ## Development History ### Model Evolution **Removed Models:** - E5 Large v2 → "clumpy" semantic clustering - Qwen2.5 Embedding 7B → OOM errors (14GB VRAM) - Jina Embeddings v3 → Superseded by v4 **Current Models (2 deployed on HF Spaces):** - all-MiniLM-L6-v2: Fast, lightweight (384 dimensions) - BGE Large EN v1.5: Larger, more capable (1024 dimensions) - Both models chosen for quality and HF Spaces compatibility ### Interface Evolution **Original Design:** - 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison) - Model name shown at bottom - Dropdown for model selection **Previous Design (6 models):** - 3 tabs (Comparison, Most & Least, Target Score) - Model dropdown at top - Slider for result count **Current Design (Production on HF Spaces):** - 2 tabs (Comparison, Most & Least Similar) - Radio buttons for model selection - Fixed 20 results in Most & Least Similar - Model name shown in each result - Streamlined, mobile-optimized interface --- ## Deployment Status (HuggingFace Spaces) ### Current Deployment ✅ **Live on HuggingFace Spaces:** - URL: hf.co/spaces/jnalv/semantic-explorer - 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5 - ChromaDB databases pre-computed and uploaded - Models load directly from HuggingFace Hub (no local paths) - Lazy loading implemented for optimal cold-start performance **Deployment Configuration:** 1. `app.py` - Main application file (HF Spaces runs this automatically) 2. Models selected for HF Spaces storage limits 3. Embeddings pre-computed locally and uploaded 4. ChromaDB databases included in repository **README.md metadata (in production):** ```yaml --- title: Semantic Word Explorer emoji: 🔍 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false --- ``` ### Deployment Notes **Storage:** - ChromaDB databases (~1GB each for 2 models) - Managed within HF Spaces storage limits **Performance:** - Models lazy-load on first selection - Cached after first load for fast switching - Cold start: ~30-60 seconds per model (acceptable for free tier) **Resource Management:** - 2 models fit comfortably in HF Spaces memory - Lazy loading ensures efficient resource usage --- ## Developer Notes ### Josh's Preferences **Code style:** - Practical over theoretical - Working prototypes over perfect architecture - Remove features that don't add clear value - Streamlined, mobile-friendly interfaces - No excessive explanation in UI **Development approach:** - Iterative refinement based on testing - Build complete prototypes to understand concepts - Modular structure (separate projects, shared resources) - Git hygiene with virtual environments ### Testing Workflow ```bash # Local testing source /home/jnalv/llm-projects/llm-env/bin/activate cd /home/jnalv/llm-projects/semantic-explorer # Prepare embeddings for a model python prepare_embeddings.py \ --wordlist english-words.txt \ --model BAAI/bge-large-en-v1.5 # Launch app python app.py # Access at http://localhost:7860 ``` ### Common Issues & Solutions **Issue:** Model not loading **Solution:** Check that `prepare_embeddings.py` was run for that model **Issue:** Collection not found **Solution:** Collection names auto-generated; verify with ChromaDB **Issue:** OOM during embedding generation **Solution:** Reduce `--batch-size` parameter **Issue:** Jina v4 task error **Solution:** Ensure `default_task` is set in model registry --- ## AI Agent Instructions When working with this codebase: 1. **Adding new models:** Update `AVAILABLE_MODELS` in app.py, add trust/task requirements if needed 2. **Modifying UI:** Edit Gradio components in `create_interface()` function 3. **Changing similarity logic:** Modify functions in app.py (`find_most_and_least_similar`, etc.) 4. **Deployment prep:** Focus on `app_hf_spaces.py` and adapt for Spaces environment 5. **Documentation:** Update relevant .md files when making significant changes ### Code Modification Guidelines - Preserve the lazy loading pattern (don't load all models at startup) - Maintain the model registry structure (central source of truth) - Keep model labels in results (user feedback requirement) - Test with multiple models before committing - Update requirements.txt for new dependencies ### Files You'll Most Often Edit 1. `app.py` - Main application logic (deployed to HF Spaces) 2. `prepare_embeddings.py` - Embedding generation (local only, for adding new models) 3. `requirements.txt` - Dependencies 4. `README.md` / `QUICKSTART.md` - User documentation ### Important Files **Production:** - `app.py` - Single app file (loads models from HuggingFace Hub) - `chromadb/` - Pre-computed embeddings for deployed models - `requirements.txt` - Dependencies for HF Spaces **Local Development:** - `prepare_embeddings.py` - Generate embeddings locally before deployment - `documents/english-words.txt` - Word list (~100k words) ### Files You Shouldn't Usually Edit - `chromadb/` - Generated data - `documents/` - Input data - `models/` - Downloaded models - Backup files (*.backup) --- ## Quick Reference ### Run Locally ```bash source /home/jnalv/llm-projects/llm-env/bin/activate cd /home/jnalv/llm-projects/semantic-explorer python app.py ``` ### Prepare New Model ```bash python prepare_embeddings.py \ --wordlist english-words.txt \ --model ``` ### Check Available Models ```bash # Look at AVAILABLE_MODELS in app.py # or run: python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))" ``` ### Test Specific Model ```bash # Launch app, select model from dropdown, test each tab ``` --- ## Related Documentation - `QUICKSTART.md` - Getting started guide - `README.md` - Comprehensive user documentation - `DEPLOYMENT.md` - HuggingFace Spaces deployment (original) - `HUGGINGFACE_SPACES_WORKFLOW.md` - Git workflow for Spaces - `CHANGES.md` - Change history - `NEW_MODELS.md` - Latest model additions - `JINA_V4_TASK_FIX.md` - Jina v4 task requirement fix --- ## Version Information **Current State:** Live on HuggingFace Spaces **Deployment Date:** January 2025 **Last Major Update:** January 2025 - Streamlined to 2 tabs, radio buttons, 2 models **Python Version:** 3.12 **Gradio Version:** 4.0.0+ **Deployed Models:** 2 embedding models (384 and 1024 dimensions) --- **END OF PROJECT DOCUMENTATION**