semantic-explorer / PROJECT.md
jnalv's picture
Cleanup: remove app_hf_spaces.py, update PROJECT.md to reflect current deployed state
87f87b4
# Semantic Explorer - Project Documentation
**Project Type:** Gradio Web Application
**Target Platform:** Hugging Face Spaces (Gradio SDK)
**Development Environment:** Ubuntu Linux with Cursor IDE
**Python Version:** 3.12
**Primary Developer:** Josh (jnalv)
---
## Project Overview
### What This App Does
Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:
1. **Compare embeddings across models** - Select from 2 different embedding models to see how they represent semantic meaning
2. **Find similar and dissimilar words** - Discover the 20 most and 20 least semantically similar words to a reference word
3. **Compare words directly** - Get similarity scores between a reference word and up to 10 comparison words
### Core Functionality
The app operates in two phases:
**Phase 1: Preparation (Local, one-time per model)**
- User runs `prepare_embeddings.py` with a word list and chosen embedding model
- Script generates embeddings for all words (typically 100k+ words)
- Embeddings stored in ChromaDB vector database
- Each model gets its own collection (e.g., `words_bge_large_en_v1_5`)
**Phase 2: Exploration (Web interface)**
- User launches Gradio app via `app.py` (deployed on HuggingFace Spaces)
- Selects embedding model from radio buttons
- Models and collections lazy-load on demand (cached after first load)
- Performs similarity searches and comparisons in real-time
### Primary Use Case
Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.
---
## Project Goals
### Immediate Goals
1. **Hugging Face Spaces Deployment** βœ… COMPLETE
- App deployed and live on HuggingFace Spaces
- 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
- Models load directly from HuggingFace Hub
- ChromaDB databases pre-computed and uploaded
- Lazy loading and caching implemented for optimal performance
2. **User Experience Polish** βœ… COMPLETE
- Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar (no slider)
- Clear model labels in all results
- Mobile-responsive design
### Long-Term Goals
- Support for custom word lists via file upload
- Visualization of embedding spaces (t-SNE/UMAP)
- Export similarity results as CSV
- Multi-language support beyond English
- API endpoint for programmatic access
---
## Technical Architecture
### Technology Stack
**Core Dependencies:**
```
gradio>=4.0.0 # Web interface framework
chromadb>=0.4.0 # Vector database
sentence-transformers>=2.2.0 # Embedding model interface
numpy>=1.24.0 # Numerical operations
torch>=2.0.0 # PyTorch backend
einops>=0.6.0 # Tensor operations (Jina v4)
peft>=0.7.0 # LoRA adapters (Jina v4)
tqdm>=4.65.0 # Progress bars
```
**Hardware Requirements:**
- GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
- RAM: 32GB DDR4
- Storage: ~500MB per model + ~1GB per embedded word list
### Embedding Models (Currently Deployed)
| Model | Params | Dimensions | VRAM | Special Requirements |
|-------|--------|-----------|------|---------------------|
| all-MiniLM-L6-v2 | 22M | 384 | ~0.5GB | None |
| BGE Large EN v1.5 | 335M | 1024 | ~2GB | None |
**Model-Specific Notes:**
- Both models load directly from HuggingFace Hub
- All models use sentence-transformers library for consistent interface
- Models are lazy-loaded and cached for performance
### Data Flow
```
User Input (reference word)
↓
Model Selection (dropdown)
↓
Lazy Load Model + Collection (if not cached)
↓
Encode Reference Word
↓
Vector Similarity Search (ChromaDB or numpy)
↓
Format Results with Model Label
↓
Display in Gradio Interface
```
### File Structure
```
/home/jnalv/llm-projects/semantic-explorer/
β”œβ”€β”€ app.py # Main Gradio interface (MOST IMPORTANT)
β”œβ”€β”€ prepare_embeddings.py # One-time embedding generation
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # User-facing documentation
β”œβ”€β”€ QUICKSTART.md # Getting started guide
β”œβ”€β”€ DEPLOYMENT.md # HuggingFace Spaces deployment
β”œβ”€β”€ CHANGES.md # Changelog
β”œβ”€β”€ documents/ # Word lists (input data)
β”‚ └── english-words.txt # ~100k English words
β”œβ”€β”€ models/ # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
β”œβ”€β”€ chromadb/ # Vector databases (one per model)
β”‚ β”œβ”€β”€ words_all_MiniLM_L6_v2/
β”‚ β”œβ”€β”€ words_bge_large_en_v1_5/
β”‚ β”œβ”€β”€ words_jina_embeddings_v4/
β”‚ β”œβ”€β”€ words_Qwen3_Embedding_0_6B/
β”‚ β”œβ”€β”€ words_embeddinggemma_300m/
β”‚ └── words_nomic_embed_text_v1_5/
└── [documentation files]
```
---
## Key Implementation Details
### Model Registry Pattern
Models are defined in a central registry (`AVAILABLE_MODELS` dict in `app.py`):
```python
AVAILABLE_MODELS = {
"model-key": {
"name": "model-key",
"display": "Model Name: X dimensions", # Shown in dropdown
"hf_id": "org/model-name", # HuggingFace model ID
"dimensions": 1024, # Embedding dimensions
"trust_remote_code": False, # Security flag
"default_task": "retrieval" # Optional: for task-specific models
}
}
```
**Adding a new model requires:**
1. Entry in `AVAILABLE_MODELS` (app.py)
2. Entry in `MODELS_REQUIRING_TRUST` if needed (prepare_embeddings.py)
3. Entry in `MODELS_REQUIRING_TASK` if needed (prepare_embeddings.py)
4. Running `prepare_embeddings.py` to create its collection
### Collection Naming Convention
Collections are auto-generated from model keys:
- Pattern: `words_{model_key}` with special chars replaced by underscores
- Example: `jinaai/jina-embeddings-v4` β†’ `words_jina_embeddings_v4`
- Implementation: `get_collection_name()` function
### Lazy Loading Strategy
Models and collections are only loaded when first selected:
```python
loaded_models = {} # Cache: model_key β†’ SentenceTransformer
loaded_collections = {} # Cache: model_key β†’ ChromaDB Collection
def load_model_and_collection(model_key):
if model_key in loaded_models:
return loaded_models[model_key], loaded_collections[model_key]
# ... load and cache
```
**Benefits:**
- Fast app startup (no models loaded initially)
- Memory efficient (only load what's used)
- Smooth model switching (cached after first load)
### Similarity Computation
**Most/Least Similar:**
- Retrieves all embeddings from collection
- Computes cosine similarity with numpy
- Sorts and returns top n / bottom n
**Target Score:**
- Finds words within tolerance of target score
- If not enough exact matches, shows "next-nearest"
- Random sampling if too many exact matches
**Comparison Tool:**
- Encodes reference + comparison words in single batch
- Computes pairwise similarities
- Sorts by similarity (descending)
---
## Critical Code Patterns
### Trust Remote Code Handling
Some models require custom code execution:
```python
# In model registry
"trust_remote_code": True
# In loading code
model = SentenceTransformer(
model_id,
trust_remote_code=model_info.get("trust_remote_code", False)
)
```
**Models requiring this:** Jina v4, Nomic v1.5
### Task-Specific Model Handling
Jina v4 uses LoRA adapters and requires a task:
```python
# In model registry
"default_task": "retrieval"
# In loading code
model_kwargs = {}
if "default_task" in model_info:
model_kwargs["default_task"] = model_info["default_task"]
model = SentenceTransformer(
model_id,
model_kwargs=model_kwargs if model_kwargs else None
)
```
### Gradio Tab Structure
Interface uses tabs for different modes:
```python
with gr.Blocks() as app:
model_selector = gr.Radio(choices=[...]) # Top-level selector with radio buttons
with gr.Tabs():
with gr.Tab("Comparison Tool"): # Tab 1
# ... comparison UI (up to 10 words)
with gr.Tab("Most & Least Similar"): # Tab 2
# ... shows 20 most + 20 least similar words
```
**Design rationale:**
- Comparison Tool first (most direct use case)
- Combined Most & Least (better for mobile, shows contrast)
- Target Score removed (didn't function as needed)
- Radio buttons for model selection (cleaner with only 2 models)
- Fixed 20 results in Most & Least (no slider needed)
---
## Development History
### Model Evolution
**Removed Models:**
- E5 Large v2 β†’ "clumpy" semantic clustering
- Qwen2.5 Embedding 7B β†’ OOM errors (14GB VRAM)
- Jina Embeddings v3 β†’ Superseded by v4
**Current Models (2 deployed on HF Spaces):**
- all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
- BGE Large EN v1.5: Larger, more capable (1024 dimensions)
- Both models chosen for quality and HF Spaces compatibility
### Interface Evolution
**Original Design:**
- 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
- Model name shown at bottom
- Dropdown for model selection
**Previous Design (6 models):**
- 3 tabs (Comparison, Most & Least, Target Score)
- Model dropdown at top
- Slider for result count
**Current Design (Production on HF Spaces):**
- 2 tabs (Comparison, Most & Least Similar)
- Radio buttons for model selection
- Fixed 20 results in Most & Least Similar
- Model name shown in each result
- Streamlined, mobile-optimized interface
---
## Deployment Status (HuggingFace Spaces)
### Current Deployment βœ…
**Live on HuggingFace Spaces:**
- URL: hf.co/spaces/jnalv/semantic-explorer
- 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
- ChromaDB databases pre-computed and uploaded
- Models load directly from HuggingFace Hub (no local paths)
- Lazy loading implemented for optimal cold-start performance
**Deployment Configuration:**
1. `app.py` - Main application file (HF Spaces runs this automatically)
2. Models selected for HF Spaces storage limits
3. Embeddings pre-computed locally and uploaded
4. ChromaDB databases included in repository
**README.md metadata (in production):**
```yaml
---
title: Semantic Word Explorer
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
---
```
### Deployment Notes
**Storage:**
- ChromaDB databases (~1GB each for 2 models)
- Managed within HF Spaces storage limits
**Performance:**
- Models lazy-load on first selection
- Cached after first load for fast switching
- Cold start: ~30-60 seconds per model (acceptable for free tier)
**Resource Management:**
- 2 models fit comfortably in HF Spaces memory
- Lazy loading ensures efficient resource usage
---
## Developer Notes
### Josh's Preferences
**Code style:**
- Practical over theoretical
- Working prototypes over perfect architecture
- Remove features that don't add clear value
- Streamlined, mobile-friendly interfaces
- No excessive explanation in UI
**Development approach:**
- Iterative refinement based on testing
- Build complete prototypes to understand concepts
- Modular structure (separate projects, shared resources)
- Git hygiene with virtual environments
### Testing Workflow
```bash
# Local testing
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
# Prepare embeddings for a model
python prepare_embeddings.py \
--wordlist english-words.txt \
--model BAAI/bge-large-en-v1.5
# Launch app
python app.py
# Access at http://localhost:7860
```
### Common Issues & Solutions
**Issue:** Model not loading
**Solution:** Check that `prepare_embeddings.py` was run for that model
**Issue:** Collection not found
**Solution:** Collection names auto-generated; verify with ChromaDB
**Issue:** OOM during embedding generation
**Solution:** Reduce `--batch-size` parameter
**Issue:** Jina v4 task error
**Solution:** Ensure `default_task` is set in model registry
---
## AI Agent Instructions
When working with this codebase:
1. **Adding new models:** Update `AVAILABLE_MODELS` in app.py, add trust/task requirements if needed
2. **Modifying UI:** Edit Gradio components in `create_interface()` function
3. **Changing similarity logic:** Modify functions in app.py (`find_most_and_least_similar`, etc.)
4. **Deployment prep:** Focus on `app_hf_spaces.py` and adapt for Spaces environment
5. **Documentation:** Update relevant .md files when making significant changes
### Code Modification Guidelines
- Preserve the lazy loading pattern (don't load all models at startup)
- Maintain the model registry structure (central source of truth)
- Keep model labels in results (user feedback requirement)
- Test with multiple models before committing
- Update requirements.txt for new dependencies
### Files You'll Most Often Edit
1. `app.py` - Main application logic (deployed to HF Spaces)
2. `prepare_embeddings.py` - Embedding generation (local only, for adding new models)
3. `requirements.txt` - Dependencies
4. `README.md` / `QUICKSTART.md` - User documentation
### Important Files
**Production:**
- `app.py` - Single app file (loads models from HuggingFace Hub)
- `chromadb/` - Pre-computed embeddings for deployed models
- `requirements.txt` - Dependencies for HF Spaces
**Local Development:**
- `prepare_embeddings.py` - Generate embeddings locally before deployment
- `documents/english-words.txt` - Word list (~100k words)
### Files You Shouldn't Usually Edit
- `chromadb/` - Generated data
- `documents/` - Input data
- `models/` - Downloaded models
- Backup files (*.backup)
---
## Quick Reference
### Run Locally
```bash
source /home/jnalv/llm-projects/llm-env/bin/activate
cd /home/jnalv/llm-projects/semantic-explorer
python app.py
```
### Prepare New Model
```bash
python prepare_embeddings.py \
--wordlist english-words.txt \
--model <model-hf-id>
```
### Check Available Models
```bash
# Look at AVAILABLE_MODELS in app.py
# or run:
python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"
```
### Test Specific Model
```bash
# Launch app, select model from dropdown, test each tab
```
---
## Related Documentation
- `QUICKSTART.md` - Getting started guide
- `README.md` - Comprehensive user documentation
- `DEPLOYMENT.md` - HuggingFace Spaces deployment (original)
- `HUGGINGFACE_SPACES_WORKFLOW.md` - Git workflow for Spaces
- `CHANGES.md` - Change history
- `NEW_MODELS.md` - Latest model additions
- `JINA_V4_TASK_FIX.md` - Jina v4 task requirement fix
---
## Version Information
**Current State:** Live on HuggingFace Spaces
**Deployment Date:** January 2025
**Last Major Update:** January 2025 - Streamlined to 2 tabs, radio buttons, 2 models
**Python Version:** 3.12
**Gradio Version:** 4.0.0+
**Deployed Models:** 2 embedding models (384 and 1024 dimensions)
---
**END OF PROJECT DOCUMENTATION**