Spaces:
Paused
Paused
| # Semantic Explorer - Project Documentation | |
| **Project Type:** Gradio Web Application | |
| **Target Platform:** Hugging Face Spaces (Gradio SDK) | |
| **Development Environment:** Ubuntu Linux with Cursor IDE | |
| **Python Version:** 3.12 | |
| **Primary Developer:** Josh (jnalv) | |
| --- | |
| ## Project Overview | |
| ### What This App Does | |
| Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can: | |
| 1. **Compare embeddings across models** - Select from 2 different embedding models to see how they represent semantic meaning | |
| 2. **Find similar and dissimilar words** - Discover the 20 most and 20 least semantically similar words to a reference word | |
| 3. **Compare words directly** - Get similarity scores between a reference word and up to 10 comparison words | |
| ### Core Functionality | |
| The app operates in two phases: | |
| **Phase 1: Preparation (Local, one-time per model)** | |
| - User runs `prepare_embeddings.py` with a word list and chosen embedding model | |
| - Script generates embeddings for all words (typically 100k+ words) | |
| - Embeddings stored in ChromaDB vector database | |
| - Each model gets its own collection (e.g., `words_bge_large_en_v1_5`) | |
| **Phase 2: Exploration (Web interface)** | |
| - User launches Gradio app via `app.py` (deployed on HuggingFace Spaces) | |
| - Selects embedding model from radio buttons | |
| - Models and collections lazy-load on demand (cached after first load) | |
| - Performs similarity searches and comparisons in real-time | |
| ### Primary Use Case | |
| Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language. | |
| --- | |
| ## Project Goals | |
| ### Immediate Goals | |
| 1. **Hugging Face Spaces Deployment** β COMPLETE | |
| - App deployed and live on HuggingFace Spaces | |
| - 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5) | |
| - Models load directly from HuggingFace Hub | |
| - ChromaDB databases pre-computed and uploaded | |
| - Lazy loading and caching implemented for optimal performance | |
| 2. **User Experience Polish** β COMPLETE | |
| - Streamlined 2-tab interface (Comparison Tool, Most & Least Similar) | |
| - Radio buttons for model selection | |
| - Fixed 20 results in Most & Least Similar (no slider) | |
| - Clear model labels in all results | |
| - Mobile-responsive design | |
| ### Long-Term Goals | |
| - Support for custom word lists via file upload | |
| - Visualization of embedding spaces (t-SNE/UMAP) | |
| - Export similarity results as CSV | |
| - Multi-language support beyond English | |
| - API endpoint for programmatic access | |
| --- | |
| ## Technical Architecture | |
| ### Technology Stack | |
| **Core Dependencies:** | |
| ``` | |
| gradio>=4.0.0 # Web interface framework | |
| chromadb>=0.4.0 # Vector database | |
| sentence-transformers>=2.2.0 # Embedding model interface | |
| numpy>=1.24.0 # Numerical operations | |
| torch>=2.0.0 # PyTorch backend | |
| einops>=0.6.0 # Tensor operations (Jina v4) | |
| peft>=0.7.0 # LoRA adapters (Jina v4) | |
| tqdm>=4.65.0 # Progress bars | |
| ``` | |
| **Hardware Requirements:** | |
| - GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system) | |
| - RAM: 32GB DDR4 | |
| - Storage: ~500MB per model + ~1GB per embedded word list | |
| ### Embedding Models (Currently Deployed) | |
| | Model | Params | Dimensions | VRAM | Special Requirements | | |
| |-------|--------|-----------|------|---------------------| | |
| | all-MiniLM-L6-v2 | 22M | 384 | ~0.5GB | None | | |
| | BGE Large EN v1.5 | 335M | 1024 | ~2GB | None | | |
| **Model-Specific Notes:** | |
| - Both models load directly from HuggingFace Hub | |
| - All models use sentence-transformers library for consistent interface | |
| - Models are lazy-loaded and cached for performance | |
| ### Data Flow | |
| ``` | |
| User Input (reference word) | |
| β | |
| Model Selection (dropdown) | |
| β | |
| Lazy Load Model + Collection (if not cached) | |
| β | |
| Encode Reference Word | |
| β | |
| Vector Similarity Search (ChromaDB or numpy) | |
| β | |
| Format Results with Model Label | |
| β | |
| Display in Gradio Interface | |
| ``` | |
| ### File Structure | |
| ``` | |
| /home/jnalv/llm-projects/semantic-explorer/ | |
| βββ app.py # Main Gradio interface (MOST IMPORTANT) | |
| βββ prepare_embeddings.py # One-time embedding generation | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # User-facing documentation | |
| βββ QUICKSTART.md # Getting started guide | |
| βββ DEPLOYMENT.md # HuggingFace Spaces deployment | |
| βββ CHANGES.md # Changelog | |
| βββ documents/ # Word lists (input data) | |
| β βββ english-words.txt # ~100k English words | |
| βββ models/ # Shared embedding models (symlink to /home/jnalv/llm-projects/models) | |
| βββ chromadb/ # Vector databases (one per model) | |
| β βββ words_all_MiniLM_L6_v2/ | |
| β βββ words_bge_large_en_v1_5/ | |
| β βββ words_jina_embeddings_v4/ | |
| β βββ words_Qwen3_Embedding_0_6B/ | |
| β βββ words_embeddinggemma_300m/ | |
| β βββ words_nomic_embed_text_v1_5/ | |
| βββ [documentation files] | |
| ``` | |
| --- | |
| ## Key Implementation Details | |
| ### Model Registry Pattern | |
| Models are defined in a central registry (`AVAILABLE_MODELS` dict in `app.py`): | |
| ```python | |
| AVAILABLE_MODELS = { | |
| "model-key": { | |
| "name": "model-key", | |
| "display": "Model Name: X dimensions", # Shown in dropdown | |
| "hf_id": "org/model-name", # HuggingFace model ID | |
| "dimensions": 1024, # Embedding dimensions | |
| "trust_remote_code": False, # Security flag | |
| "default_task": "retrieval" # Optional: for task-specific models | |
| } | |
| } | |
| ``` | |
| **Adding a new model requires:** | |
| 1. Entry in `AVAILABLE_MODELS` (app.py) | |
| 2. Entry in `MODELS_REQUIRING_TRUST` if needed (prepare_embeddings.py) | |
| 3. Entry in `MODELS_REQUIRING_TASK` if needed (prepare_embeddings.py) | |
| 4. Running `prepare_embeddings.py` to create its collection | |
| ### Collection Naming Convention | |
| Collections are auto-generated from model keys: | |
| - Pattern: `words_{model_key}` with special chars replaced by underscores | |
| - Example: `jinaai/jina-embeddings-v4` β `words_jina_embeddings_v4` | |
| - Implementation: `get_collection_name()` function | |
| ### Lazy Loading Strategy | |
| Models and collections are only loaded when first selected: | |
| ```python | |
| loaded_models = {} # Cache: model_key β SentenceTransformer | |
| loaded_collections = {} # Cache: model_key β ChromaDB Collection | |
| def load_model_and_collection(model_key): | |
| if model_key in loaded_models: | |
| return loaded_models[model_key], loaded_collections[model_key] | |
| # ... load and cache | |
| ``` | |
| **Benefits:** | |
| - Fast app startup (no models loaded initially) | |
| - Memory efficient (only load what's used) | |
| - Smooth model switching (cached after first load) | |
| ### Similarity Computation | |
| **Most/Least Similar:** | |
| - Retrieves all embeddings from collection | |
| - Computes cosine similarity with numpy | |
| - Sorts and returns top n / bottom n | |
| **Target Score:** | |
| - Finds words within tolerance of target score | |
| - If not enough exact matches, shows "next-nearest" | |
| - Random sampling if too many exact matches | |
| **Comparison Tool:** | |
| - Encodes reference + comparison words in single batch | |
| - Computes pairwise similarities | |
| - Sorts by similarity (descending) | |
| --- | |
| ## Critical Code Patterns | |
| ### Trust Remote Code Handling | |
| Some models require custom code execution: | |
| ```python | |
| # In model registry | |
| "trust_remote_code": True | |
| # In loading code | |
| model = SentenceTransformer( | |
| model_id, | |
| trust_remote_code=model_info.get("trust_remote_code", False) | |
| ) | |
| ``` | |
| **Models requiring this:** Jina v4, Nomic v1.5 | |
| ### Task-Specific Model Handling | |
| Jina v4 uses LoRA adapters and requires a task: | |
| ```python | |
| # In model registry | |
| "default_task": "retrieval" | |
| # In loading code | |
| model_kwargs = {} | |
| if "default_task" in model_info: | |
| model_kwargs["default_task"] = model_info["default_task"] | |
| model = SentenceTransformer( | |
| model_id, | |
| model_kwargs=model_kwargs if model_kwargs else None | |
| ) | |
| ``` | |
| ### Gradio Tab Structure | |
| Interface uses tabs for different modes: | |
| ```python | |
| with gr.Blocks() as app: | |
| model_selector = gr.Radio(choices=[...]) # Top-level selector with radio buttons | |
| with gr.Tabs(): | |
| with gr.Tab("Comparison Tool"): # Tab 1 | |
| # ... comparison UI (up to 10 words) | |
| with gr.Tab("Most & Least Similar"): # Tab 2 | |
| # ... shows 20 most + 20 least similar words | |
| ``` | |
| **Design rationale:** | |
| - Comparison Tool first (most direct use case) | |
| - Combined Most & Least (better for mobile, shows contrast) | |
| - Target Score removed (didn't function as needed) | |
| - Radio buttons for model selection (cleaner with only 2 models) | |
| - Fixed 20 results in Most & Least (no slider needed) | |
| --- | |
| ## Development History | |
| ### Model Evolution | |
| **Removed Models:** | |
| - E5 Large v2 β "clumpy" semantic clustering | |
| - Qwen2.5 Embedding 7B β OOM errors (14GB VRAM) | |
| - Jina Embeddings v3 β Superseded by v4 | |
| **Current Models (2 deployed on HF Spaces):** | |
| - all-MiniLM-L6-v2: Fast, lightweight (384 dimensions) | |
| - BGE Large EN v1.5: Larger, more capable (1024 dimensions) | |
| - Both models chosen for quality and HF Spaces compatibility | |
| ### Interface Evolution | |
| **Original Design:** | |
| - 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison) | |
| - Model name shown at bottom | |
| - Dropdown for model selection | |
| **Previous Design (6 models):** | |
| - 3 tabs (Comparison, Most & Least, Target Score) | |
| - Model dropdown at top | |
| - Slider for result count | |
| **Current Design (Production on HF Spaces):** | |
| - 2 tabs (Comparison, Most & Least Similar) | |
| - Radio buttons for model selection | |
| - Fixed 20 results in Most & Least Similar | |
| - Model name shown in each result | |
| - Streamlined, mobile-optimized interface | |
| --- | |
| ## Deployment Status (HuggingFace Spaces) | |
| ### Current Deployment β | |
| **Live on HuggingFace Spaces:** | |
| - URL: hf.co/spaces/jnalv/semantic-explorer | |
| - 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5 | |
| - ChromaDB databases pre-computed and uploaded | |
| - Models load directly from HuggingFace Hub (no local paths) | |
| - Lazy loading implemented for optimal cold-start performance | |
| **Deployment Configuration:** | |
| 1. `app.py` - Main application file (HF Spaces runs this automatically) | |
| 2. Models selected for HF Spaces storage limits | |
| 3. Embeddings pre-computed locally and uploaded | |
| 4. ChromaDB databases included in repository | |
| **README.md metadata (in production):** | |
| ```yaml | |
| --- | |
| title: Semantic Word Explorer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.0.0 | |
| app_file: app.py | |
| pinned: false | |
| --- | |
| ``` | |
| ### Deployment Notes | |
| **Storage:** | |
| - ChromaDB databases (~1GB each for 2 models) | |
| - Managed within HF Spaces storage limits | |
| **Performance:** | |
| - Models lazy-load on first selection | |
| - Cached after first load for fast switching | |
| - Cold start: ~30-60 seconds per model (acceptable for free tier) | |
| **Resource Management:** | |
| - 2 models fit comfortably in HF Spaces memory | |
| - Lazy loading ensures efficient resource usage | |
| --- | |
| ## Developer Notes | |
| ### Josh's Preferences | |
| **Code style:** | |
| - Practical over theoretical | |
| - Working prototypes over perfect architecture | |
| - Remove features that don't add clear value | |
| - Streamlined, mobile-friendly interfaces | |
| - No excessive explanation in UI | |
| **Development approach:** | |
| - Iterative refinement based on testing | |
| - Build complete prototypes to understand concepts | |
| - Modular structure (separate projects, shared resources) | |
| - Git hygiene with virtual environments | |
| ### Testing Workflow | |
| ```bash | |
| # Local testing | |
| source /home/jnalv/llm-projects/llm-env/bin/activate | |
| cd /home/jnalv/llm-projects/semantic-explorer | |
| # Prepare embeddings for a model | |
| python prepare_embeddings.py \ | |
| --wordlist english-words.txt \ | |
| --model BAAI/bge-large-en-v1.5 | |
| # Launch app | |
| python app.py | |
| # Access at http://localhost:7860 | |
| ``` | |
| ### Common Issues & Solutions | |
| **Issue:** Model not loading | |
| **Solution:** Check that `prepare_embeddings.py` was run for that model | |
| **Issue:** Collection not found | |
| **Solution:** Collection names auto-generated; verify with ChromaDB | |
| **Issue:** OOM during embedding generation | |
| **Solution:** Reduce `--batch-size` parameter | |
| **Issue:** Jina v4 task error | |
| **Solution:** Ensure `default_task` is set in model registry | |
| --- | |
| ## AI Agent Instructions | |
| When working with this codebase: | |
| 1. **Adding new models:** Update `AVAILABLE_MODELS` in app.py, add trust/task requirements if needed | |
| 2. **Modifying UI:** Edit Gradio components in `create_interface()` function | |
| 3. **Changing similarity logic:** Modify functions in app.py (`find_most_and_least_similar`, etc.) | |
| 4. **Deployment prep:** Focus on `app_hf_spaces.py` and adapt for Spaces environment | |
| 5. **Documentation:** Update relevant .md files when making significant changes | |
| ### Code Modification Guidelines | |
| - Preserve the lazy loading pattern (don't load all models at startup) | |
| - Maintain the model registry structure (central source of truth) | |
| - Keep model labels in results (user feedback requirement) | |
| - Test with multiple models before committing | |
| - Update requirements.txt for new dependencies | |
| ### Files You'll Most Often Edit | |
| 1. `app.py` - Main application logic (deployed to HF Spaces) | |
| 2. `prepare_embeddings.py` - Embedding generation (local only, for adding new models) | |
| 3. `requirements.txt` - Dependencies | |
| 4. `README.md` / `QUICKSTART.md` - User documentation | |
| ### Important Files | |
| **Production:** | |
| - `app.py` - Single app file (loads models from HuggingFace Hub) | |
| - `chromadb/` - Pre-computed embeddings for deployed models | |
| - `requirements.txt` - Dependencies for HF Spaces | |
| **Local Development:** | |
| - `prepare_embeddings.py` - Generate embeddings locally before deployment | |
| - `documents/english-words.txt` - Word list (~100k words) | |
| ### Files You Shouldn't Usually Edit | |
| - `chromadb/` - Generated data | |
| - `documents/` - Input data | |
| - `models/` - Downloaded models | |
| - Backup files (*.backup) | |
| --- | |
| ## Quick Reference | |
| ### Run Locally | |
| ```bash | |
| source /home/jnalv/llm-projects/llm-env/bin/activate | |
| cd /home/jnalv/llm-projects/semantic-explorer | |
| python app.py | |
| ``` | |
| ### Prepare New Model | |
| ```bash | |
| python prepare_embeddings.py \ | |
| --wordlist english-words.txt \ | |
| --model <model-hf-id> | |
| ``` | |
| ### Check Available Models | |
| ```bash | |
| # Look at AVAILABLE_MODELS in app.py | |
| # or run: | |
| python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))" | |
| ``` | |
| ### Test Specific Model | |
| ```bash | |
| # Launch app, select model from dropdown, test each tab | |
| ``` | |
| --- | |
| ## Related Documentation | |
| - `QUICKSTART.md` - Getting started guide | |
| - `README.md` - Comprehensive user documentation | |
| - `DEPLOYMENT.md` - HuggingFace Spaces deployment (original) | |
| - `HUGGINGFACE_SPACES_WORKFLOW.md` - Git workflow for Spaces | |
| - `CHANGES.md` - Change history | |
| - `NEW_MODELS.md` - Latest model additions | |
| - `JINA_V4_TASK_FIX.md` - Jina v4 task requirement fix | |
| --- | |
| ## Version Information | |
| **Current State:** Live on HuggingFace Spaces | |
| **Deployment Date:** January 2025 | |
| **Last Major Update:** January 2025 - Streamlined to 2 tabs, radio buttons, 2 models | |
| **Python Version:** 3.12 | |
| **Gradio Version:** 4.0.0+ | |
| **Deployed Models:** 2 embedding models (384 and 1024 dimensions) | |
| --- | |
| **END OF PROJECT DOCUMENTATION** | |