Spaces:

jnalv
/

semantic-explorer

Paused

App Files Files Community

semantic-explorer / PROJECT.md

jnalv

Cleanup: remove app_hf_spaces.py, update PROJECT.md to reflect current deployed state

87f87b4 about 2 months ago

preview code

raw

history blame contribute delete

15.2 kB

	# Semantic Explorer - Project Documentation

	Project Type: Gradio Web Application
	Target Platform: Hugging Face Spaces (Gradio SDK)
	Development Environment: Ubuntu Linux with Cursor IDE
	Python Version: 3.12
	Primary Developer: Josh (jnalv)

	---

	## Project Overview

	### What This App Does

	Semantic Explorer is an interactive web application that allows users to explore semantic relationships between words using multiple embedding models. Users can:

	1. Compare embeddings across models - Select from 2 different embedding models to see how they represent semantic meaning
	2. Find similar and dissimilar words - Discover the 20 most and 20 least semantically similar words to a reference word
	3. Compare words directly - Get similarity scores between a reference word and up to 10 comparison words

	### Core Functionality

	The app operates in two phases:

	Phase 1: Preparation (Local, one-time per model)
	- User runs `prepare_embeddings.py` with a word list and chosen embedding model
	- Script generates embeddings for all words (typically 100k+ words)
	- Embeddings stored in ChromaDB vector database
	- Each model gets its own collection (e.g., `words_bge_large_en_v1_5`)

	Phase 2: Exploration (Web interface)
	- User launches Gradio app via `app.py` (deployed on HuggingFace Spaces)
	- Selects embedding model from radio buttons
	- Models and collections lazy-load on demand (cached after first load)
	- Performs similarity searches and comparisons in real-time

	### Primary Use Case

	Educational and research tool for understanding how different embedding models capture semantic meaning. Allows side-by-side comparison of model behaviors when representing language.

	---

	## Project Goals

	### Immediate Goals

	1. Hugging Face Spaces Deployment ✅ COMPLETE
	- App deployed and live on HuggingFace Spaces
	- 2 embedding models integrated and tested (all-MiniLM-L6-v2, BGE Large EN v1.5)
	- Models load directly from HuggingFace Hub
	- ChromaDB databases pre-computed and uploaded
	- Lazy loading and caching implemented for optimal performance

	2. User Experience Polish ✅ COMPLETE
	- Streamlined 2-tab interface (Comparison Tool, Most & Least Similar)
	- Radio buttons for model selection
	- Fixed 20 results in Most & Least Similar (no slider)
	- Clear model labels in all results
	- Mobile-responsive design

	### Long-Term Goals

	- Support for custom word lists via file upload
	- Visualization of embedding spaces (t-SNE/UMAP)
	- Export similarity results as CSV
	- Multi-language support beyond English
	- API endpoint for programmatic access

	---

	## Technical Architecture

	### Technology Stack

	Core Dependencies:
	```
	gradio>=4.0.0 # Web interface framework
	chromadb>=0.4.0 # Vector database
	sentence-transformers>=2.2.0 # Embedding model interface
	numpy>=1.24.0 # Numerical operations
	torch>=2.0.0 # PyTorch backend
	einops>=0.6.0 # Tensor operations (Jina v4)
	peft>=0.7.0 # LoRA adapters (Jina v4)
	tqdm>=4.65.0 # Progress bars
	```

	Hardware Requirements:
	- GPU: NVIDIA RTX 4060Ti 16GB VRAM (development system)
	- RAM: 32GB DDR4
	- Storage: ~500MB per model + ~1GB per embedded word list

	### Embedding Models (Currently Deployed)

	\| Model \| Params \| Dimensions \| VRAM \| Special Requirements \|
	\|-------\|--------\|-----------\|------\|---------------------\|
	\| all-MiniLM-L6-v2 \| 22M \| 384 \| ~0.5GB \| None \|
	\| BGE Large EN v1.5 \| 335M \| 1024 \| ~2GB \| None \|

	Model-Specific Notes:
	- Both models load directly from HuggingFace Hub
	- All models use sentence-transformers library for consistent interface
	- Models are lazy-loaded and cached for performance

	### Data Flow

	```
	User Input (reference word)
	↓
	Model Selection (dropdown)
	↓
	Lazy Load Model + Collection (if not cached)
	↓
	Encode Reference Word
	↓
	Vector Similarity Search (ChromaDB or numpy)
	↓
	Format Results with Model Label
	↓
	Display in Gradio Interface
	```

	### File Structure

	```
	/home/jnalv/llm-projects/semantic-explorer/
	├── app.py # Main Gradio interface (MOST IMPORTANT)
	├── prepare_embeddings.py # One-time embedding generation
	├── requirements.txt # Python dependencies
	├── README.md # User-facing documentation
	├── QUICKSTART.md # Getting started guide
	├── DEPLOYMENT.md # HuggingFace Spaces deployment
	├── CHANGES.md # Changelog
	├── documents/ # Word lists (input data)
	│ └── english-words.txt # ~100k English words
	├── models/ # Shared embedding models (symlink to /home/jnalv/llm-projects/models)
	├── chromadb/ # Vector databases (one per model)
	│ ├── words_all_MiniLM_L6_v2/
	│ ├── words_bge_large_en_v1_5/
	│ ├── words_jina_embeddings_v4/
	│ ├── words_Qwen3_Embedding_0_6B/
	│ ├── words_embeddinggemma_300m/
	│ └── words_nomic_embed_text_v1_5/
	└── [documentation files]
	```

	---

	## Key Implementation Details

	### Model Registry Pattern

	Models are defined in a central registry (`AVAILABLE_MODELS` dict in `app.py`):

	```python
	AVAILABLE_MODELS = {
	"model-key": {
	"name": "model-key",
	"display": "Model Name: X dimensions", # Shown in dropdown
	"hf_id": "org/model-name", # HuggingFace model ID
	"dimensions": 1024, # Embedding dimensions
	"trust_remote_code": False, # Security flag
	"default_task": "retrieval" # Optional: for task-specific models
	}
	}
	```

	Adding a new model requires:
	1. Entry in `AVAILABLE_MODELS` (app.py)
	2. Entry in `MODELS_REQUIRING_TRUST` if needed (prepare_embeddings.py)
	3. Entry in `MODELS_REQUIRING_TASK` if needed (prepare_embeddings.py)
	4. Running `prepare_embeddings.py` to create its collection

	### Collection Naming Convention

	Collections are auto-generated from model keys:
	- Pattern: `words_{model_key}` with special chars replaced by underscores
	- Example: `jinaai/jina-embeddings-v4` → `words_jina_embeddings_v4`
	- Implementation: `get_collection_name()` function

	### Lazy Loading Strategy

	Models and collections are only loaded when first selected:

	```python
	loaded_models = {} # Cache: model_key → SentenceTransformer
	loaded_collections = {} # Cache: model_key → ChromaDB Collection

	def load_model_and_collection(model_key):
	if model_key in loaded_models:
	return loaded_models[model_key], loaded_collections[model_key]
	# ... load and cache
	```

	Benefits:
	- Fast app startup (no models loaded initially)
	- Memory efficient (only load what's used)
	- Smooth model switching (cached after first load)

	### Similarity Computation

	Most/Least Similar:
	- Retrieves all embeddings from collection
	- Computes cosine similarity with numpy
	- Sorts and returns top n / bottom n

	Target Score:
	- Finds words within tolerance of target score
	- If not enough exact matches, shows "next-nearest"
	- Random sampling if too many exact matches

	Comparison Tool:
	- Encodes reference + comparison words in single batch
	- Computes pairwise similarities
	- Sorts by similarity (descending)

	---

	## Critical Code Patterns

	### Trust Remote Code Handling

	Some models require custom code execution:

	```python
	# In model registry
	"trust_remote_code": True

	# In loading code
	model = SentenceTransformer(
	model_id,
	trust_remote_code=model_info.get("trust_remote_code", False)
	)
	```

	Models requiring this: Jina v4, Nomic v1.5

	### Task-Specific Model Handling

	Jina v4 uses LoRA adapters and requires a task:

	```python
	# In model registry
	"default_task": "retrieval"

	# In loading code
	model_kwargs = {}
	if "default_task" in model_info:
	model_kwargs["default_task"] = model_info["default_task"]

	model = SentenceTransformer(
	model_id,
	model_kwargs=model_kwargs if model_kwargs else None
	)
	```

	### Gradio Tab Structure

	Interface uses tabs for different modes:

	```python
	with gr.Blocks() as app:
	model_selector = gr.Radio(choices=[...]) # Top-level selector with radio buttons

	with gr.Tabs():
	with gr.Tab("Comparison Tool"): # Tab 1
	# ... comparison UI (up to 10 words)
	with gr.Tab("Most & Least Similar"): # Tab 2
	# ... shows 20 most + 20 least similar words
	```

	Design rationale:
	- Comparison Tool first (most direct use case)
	- Combined Most & Least (better for mobile, shows contrast)
	- Target Score removed (didn't function as needed)
	- Radio buttons for model selection (cleaner with only 2 models)
	- Fixed 20 results in Most & Least (no slider needed)

	---

	## Development History

	### Model Evolution

	Removed Models:
	- E5 Large v2 → "clumpy" semantic clustering
	- Qwen2.5 Embedding 7B → OOM errors (14GB VRAM)
	- Jina Embeddings v3 → Superseded by v4

	Current Models (2 deployed on HF Spaces):
	- all-MiniLM-L6-v2: Fast, lightweight (384 dimensions)
	- BGE Large EN v1.5: Larger, more capable (1024 dimensions)
	- Both models chosen for quality and HF Spaces compatibility

	### Interface Evolution

	Original Design:
	- 5 separate tabs (Most Similar, Least Similar, Target Score, Target Range, Comparison)
	- Model name shown at bottom
	- Dropdown for model selection

	Previous Design (6 models):
	- 3 tabs (Comparison, Most & Least, Target Score)
	- Model dropdown at top
	- Slider for result count

	Current Design (Production on HF Spaces):
	- 2 tabs (Comparison, Most & Least Similar)
	- Radio buttons for model selection
	- Fixed 20 results in Most & Least Similar
	- Model name shown in each result
	- Streamlined, mobile-optimized interface

	---

	## Deployment Status (HuggingFace Spaces)

	### Current Deployment ✅

	Live on HuggingFace Spaces:
	- URL: hf.co/spaces/jnalv/semantic-explorer
	- 2 models deployed: all-MiniLM-L6-v2, BGE Large EN v1.5
	- ChromaDB databases pre-computed and uploaded
	- Models load directly from HuggingFace Hub (no local paths)
	- Lazy loading implemented for optimal cold-start performance

	Deployment Configuration:
	1. `app.py` - Main application file (HF Spaces runs this automatically)
	2. Models selected for HF Spaces storage limits
	3. Embeddings pre-computed locally and uploaded
	4. ChromaDB databases included in repository

	README.md metadata (in production):
	```yaml
	---
	title: Semantic Word Explorer
	emoji: 🔍
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.0.0
	app_file: app.py
	pinned: false
	---
	```

	### Deployment Notes

	Storage:
	- ChromaDB databases (~1GB each for 2 models)
	- Managed within HF Spaces storage limits

	Performance:
	- Models lazy-load on first selection
	- Cached after first load for fast switching
	- Cold start: ~30-60 seconds per model (acceptable for free tier)

	Resource Management:
	- 2 models fit comfortably in HF Spaces memory
	- Lazy loading ensures efficient resource usage

	---

	## Developer Notes

	### Josh's Preferences

	Code style:
	- Practical over theoretical
	- Working prototypes over perfect architecture
	- Remove features that don't add clear value
	- Streamlined, mobile-friendly interfaces
	- No excessive explanation in UI

	Development approach:
	- Iterative refinement based on testing
	- Build complete prototypes to understand concepts
	- Modular structure (separate projects, shared resources)
	- Git hygiene with virtual environments

	### Testing Workflow

	```bash
	# Local testing
	source /home/jnalv/llm-projects/llm-env/bin/activate
	cd /home/jnalv/llm-projects/semantic-explorer

	# Prepare embeddings for a model
	python prepare_embeddings.py \
	--wordlist english-words.txt \
	--model BAAI/bge-large-en-v1.5

	# Launch app
	python app.py

	# Access at http://localhost:7860
	```

	### Common Issues & Solutions

	Issue: Model not loading
	Solution: Check that `prepare_embeddings.py` was run for that model

	Issue: Collection not found
	Solution: Collection names auto-generated; verify with ChromaDB

	Issue: OOM during embedding generation
	Solution: Reduce `--batch-size` parameter

	Issue: Jina v4 task error
	Solution: Ensure `default_task` is set in model registry

	---

	## AI Agent Instructions

	When working with this codebase:

	1. Adding new models: Update `AVAILABLE_MODELS` in app.py, add trust/task requirements if needed
	2. Modifying UI: Edit Gradio components in `create_interface()` function
	3. Changing similarity logic: Modify functions in app.py (`find_most_and_least_similar`, etc.)
	4. Deployment prep: Focus on `app_hf_spaces.py` and adapt for Spaces environment
	5. Documentation: Update relevant .md files when making significant changes

	### Code Modification Guidelines

	- Preserve the lazy loading pattern (don't load all models at startup)
	- Maintain the model registry structure (central source of truth)
	- Keep model labels in results (user feedback requirement)
	- Test with multiple models before committing
	- Update requirements.txt for new dependencies

	### Files You'll Most Often Edit

	1. `app.py` - Main application logic (deployed to HF Spaces)
	2. `prepare_embeddings.py` - Embedding generation (local only, for adding new models)
	3. `requirements.txt` - Dependencies
	4. `README.md` / `QUICKSTART.md` - User documentation

	### Important Files

	Production:
	- `app.py` - Single app file (loads models from HuggingFace Hub)
	- `chromadb/` - Pre-computed embeddings for deployed models
	- `requirements.txt` - Dependencies for HF Spaces

	Local Development:
	- `prepare_embeddings.py` - Generate embeddings locally before deployment
	- `documents/english-words.txt` - Word list (~100k words)

	### Files You Shouldn't Usually Edit

	- `chromadb/` - Generated data
	- `documents/` - Input data
	- `models/` - Downloaded models
	- Backup files (*.backup)

	---

	## Quick Reference

	### Run Locally
	```bash
	source /home/jnalv/llm-projects/llm-env/bin/activate
	cd /home/jnalv/llm-projects/semantic-explorer
	python app.py
	```

	### Prepare New Model
	```bash
	python prepare_embeddings.py \
	--wordlist english-words.txt \
	--model <model-hf-id>
	```

	### Check Available Models
	```bash
	# Look at AVAILABLE_MODELS in app.py
	# or run:
	python -c "import app; print(list(app.AVAILABLE_MODELS.keys()))"
	```

	### Test Specific Model
	```bash
	# Launch app, select model from dropdown, test each tab
	```

	---

	## Related Documentation

	- `QUICKSTART.md` - Getting started guide
	- `README.md` - Comprehensive user documentation
	- `DEPLOYMENT.md` - HuggingFace Spaces deployment (original)
	- `HUGGINGFACE_SPACES_WORKFLOW.md` - Git workflow for Spaces
	- `CHANGES.md` - Change history
	- `NEW_MODELS.md` - Latest model additions
	- `JINA_V4_TASK_FIX.md` - Jina v4 task requirement fix

	---

	## Version Information

	Current State: Live on HuggingFace Spaces
	Deployment Date: January 2025
	Last Major Update: January 2025 - Streamlined to 2 tabs, radio buttons, 2 models
	Python Version: 3.12
	Gradio Version: 4.0.0+
	Deployed Models: 2 embedding models (384 and 1024 dimensions)

	---

	END OF PROJECT DOCUMENTATION