Spaces:

SongLift
/

LyrGen2

Runtime error

App Files Files Community

James Edmunds commited on Jul 15, 2025

Commit

af1818d

1 Parent(s): 40c5feb

Security improvements: Enhanced README, added .env.example, removed docs from tracking

Browse files

Files changed (7) hide show

.env.example +11 -0
.gitignore +3 -0
README.md +220 -4
docs/NOTES.md +0 -102
docs/PROJECT_README.md +0 -125
docs/TODO.md +0 -43
docs/TROUBLESHOOTING.md +0 -44

.env.example ADDED Viewed

	@@ -0,0 +1,11 @@

+# OpenAI API Key
+# Get your key from: https://platform.openai.com/api-keys
+OPENAI_API_KEY=your_openai_api_key_here
+# HuggingFace Token (optional, for dataset access)
+# Get your token from: https://huggingface.co/settings/tokens
+HF_TOKEN=your_huggingface_token_here
+# Deployment Mode
+# 'local' for development, 'huggingface' for HF Space
+DEPLOYMENT_MODE=local

.gitignore CHANGED Viewed

@@ -44,6 +44,9 @@ htmlcov/
 .env.local
 TODO.txt
 # Huggingface
 .hf/
 .huggingface/

 .env.local
 TODO.txt
+# Documentation (keep private)
+docs/
 # Huggingface
 .hf/
 .huggingface/

README.md CHANGED Viewed

@@ -9,10 +9,226 @@ app_file: app.py
 pinned: false
 ---
-# SongLift LyrGen2
-An AI-powered lyric generator using LangChain and RAG (Retrieval-Augmented Generation) technology. This application leverages OpenAI's models and a database of existing lyrics to generate new, contextually-aware song lyrics with modern pop and hip-hop sensibilities.
-## Browser Compatibility
-⚠️ For best results, use Chrome or another Chromium-based browser. Some features may not work correctly in Safari.

 pinned: false
 ---
+# SongLift LyrGen2 🎵
+An AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. Built with LangChain, RAG (Retrieval-Augmented Generation), and OpenAI's GPT-4.
+## 🚀 Live Demo
+**[Try it on HuggingFace Spaces](https://huggingface.co/spaces/SongLift/LyrGen2)**
+## ✨ Features
+- **Semantic Lyrics Generation**: Uses vector embeddings of 234K+ lyrics for contextual understanding
+- **RAG Technology**: Retrieval-Augmented Generation finds similar lyrics to inform new creations
+- **Modern Sensibilities**: Trained on contemporary pop and hip-hop lyrics
+- **Interactive Web Interface**: Clean Streamlit interface for easy use
+- **Source Attribution**: Shows which lyrics influenced the generation
+## 🏗️ Architecture
+### Core Components
+- **Vector Database**: ChromaDB with OpenAI Ada-002 embeddings
+- **AI Models**: GPT-4 for generation, Ada-002 for embeddings
+- **Data Pipeline**: Automated processing of raw lyrics into searchable embeddings
+- **Dual Deployment**: Local development + HuggingFace Spaces production
+### Workflow
+```
+Raw Lyrics → Data Cleaning → Text Chunking → Embeddings → ChromaDB → Generation
+```
+## 🛠️ Local Development
+### Prerequisites
+- Python 3.8+
+- OpenAI API key
+- HuggingFace token (optional, for dataset access)
+### Setup
+```bash
+# Clone the repository
+git clone <your-repo-url>
+cd SongLift_LyrGen2
+# Create virtual environment
+python -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+# Install dependencies
+pip install -r requirements.txt
+# Configure environment
+cp .env.example .env
+# Edit .env with your API keys
+```
+### Environment Variables
+Create a `.env` file with:
+```env
+OPENAI_API_KEY=your_openai_api_key_here
+HF_TOKEN=your_huggingface_token_here
+DEPLOYMENT_MODE=local
+```
+### Run Locally
+```bash
+streamlit run app.py
+```
+Visit `http://localhost:8501`
+## 🧪 Testing & Validation
+```bash
+# Test your environment setup
+python scripts/test_environment.py
+# Test OpenAI connection
+python scripts/test_openai_connection.py
+# Validate embeddings database
+python scripts/test_embeddings.py
+```
+## 📊 Data Processing
+The system processes lyrics through a sophisticated pipeline:
+1. **Raw Data Loading** (`scripts/process_lyrics.py`)
+   - Multi-encoding support (UTF-8, Latin-1, CP1252)
+   - Section detection ([Verse], [Chorus], etc.)
+   - Metadata preservation
+2. **Text Processing**
+   - Recursive text splitting (300 chars, 75 overlap)
+   - Batch processing with rate limiting
+   - Automatic retry on API limits
+3. **Vector Storage**
+   - ChromaDB collection: "lyrics_v1"
+   - ~234K embedded documents
+   - Metadata tracking (artist, song title)
+## 🚀 Deployment
+### HuggingFace Spaces
+The app auto-deploys to HuggingFace Spaces via GitHub sync:
+- **Space**: [SongLift/LyrGen2](https://huggingface.co/spaces/SongLift/LyrGen2)
+- **Dataset**: [SongLift/LyrGen2_DB](https://huggingface.co/datasets/SongLift/LyrGen2_DB)
+Configure secrets in HF Spaces settings:
+- `OPENAI_API_KEY`
+- `HF_TOKEN`
+### Local to Production Sync
+```bash
+# Process and upload embeddings
+python scripts/process_lyrics.py
+python scripts/upload_embeddings.py
+```
+## 🔧 Configuration
+Key configuration in `config/settings.py`:
+- **Models**: GPT-4 for generation, Ada-002 for embeddings
+- **Paths**: Auto-detects local vs HuggingFace environment
+- **Database**: ChromaDB with persistent storage
+## 📁 Project Structure
+```
+SongLift_LyrGen2/
+├── app.py                 # Main Streamlit application
+├── config/
+│   └── settings.py        # Central configuration
+├── src/
+│   ├── generator/         # Core generation logic
+│   └── utils/            # Utility functions
+├── scripts/              # Data processing & testing
+├── data/
+│   ├── raw/lyrics/       # Original lyrics files
+│   └── processed/        # Embeddings & processed data
+└── docs/                 # Documentation
+```
+## 🔍 Browser Compatibility
+⚠️ **Recommended**: Chrome or Chromium-based browsers for optimal performance. Some features may not work correctly in Safari.
+## � HouggingFace Spaces Setup
+### Deploy Your Own Space
+1. **Create a HuggingFace Space**:
+   - Go to [HuggingFace Spaces](https://huggingface.co/spaces)
+   - Click "Create new Space"
+   - Choose "Streamlit" as SDK
+   - Set `app_file: app.py`
+2. **Configure Secrets**:
+   - In your Space settings, add these secrets:
+     - `OPENAI_API_KEY`: Your OpenAI API key
+     - `HF_TOKEN`: Your HuggingFace token (for dataset access)
+3. **Upload Your Dataset**:
+   ```bash
+   # Process and upload embeddings to HF dataset
+   python scripts/process_lyrics.py
+   python scripts/upload_embeddings.py
+   ```
+4. **Sync with GitHub** (optional):
+   - Connect your Space to a GitHub repo for automatic deployments
+   - Push changes to GitHub → auto-deploys to HF Spaces
+### Running HuggingFace Locally
+You can test the HuggingFace environment locally:
+```bash
+# Set HuggingFace mode
+export DEPLOYMENT_MODE=huggingface
+# Run locally (will use HF dataset paths)
+streamlit run app.py
+```
+This helps debug HF-specific issues before deploying.
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
+## 📄 License
+MIT License
+Copyright (c) 2024 SongLift
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+## 🙏 Acknowledgments
+- Built with [LangChain](https://langchain.com/) and [Streamlit](https://streamlit.io/)
+- Powered by [OpenAI](https://openai.com/) and [HuggingFace](https://huggingface.co/)
+- Vector storage by [ChromaDB](https://www.trychroma.com/)

docs/NOTES.md DELETED Viewed

@@ -1,102 +0,0 @@
-Notes:
-I could fine tune an openai model and access it via langchain if i wanted to:
-https://python.langchain.com/docs/integrations/chat/openai/
-    This generally takes the form of ft:{OPENAI_MODEL_NAME}:{ORG_NAME}::{MODEL_ID}. For example:
-    fine_tuned_model = ChatOpenAI(
-        temperature=0, model_name="ft:gpt-3.5-turbo-0613:langchain::7qTVM5AR"
-    )
-    fine_tuned_model.invoke(messages)
-    IMPORTANT SECTIONS:
-    generator.py:
-    def setup_qa_chain(self):
-        """Initialize the QA chain for generating lyrics"""
-        retriever = self.vector_store.as_retriever(
-            search_kwargs={
-                "k": 500,  # Increased from 3 to 10
-                "fetch_k": 5000,  # Fetch more candidates before selecting top k
-                "score_threshold": 0.7  # Only use relevant documents
-            }
-MODEL kwarg meanings:
-Default Penalties
-Default values are 0.0 for both presence and frequency penalties. This means:
-No additional penalty for repeated tokens
-No additional penalty for using common tokens
-Effects:
-Higher temperature: More creative/random → better for unique lyrics
-Higher top_p: More vocabulary variety → richer language
-Higher presence/frequency penalties: Less repetition → more unique phrases
-Parameter Deep Dive
-Temperature (0.0 - 2.0, default 1.0)
-creative
-Controls randomness in token selection
-Lower = more deterministic, focused, and conservative
-Higher = more creative, random, and potentially chaotic
-Think of it as "creativity vs consistency"
-Top_p (0.0 - 1.0, default 1.0)
-mass
-- Controls diversity by limiting cumulative probability of next tokens
-Similar to temperature but more precise
-Lower values = more focused vocabulary
-Higher values = more diverse word choices
-Often preferred over temperature for creative tasks
-Presence Penalty (-2.0 to 2.0)
-tokens
-- Penalizes tokens that have appeared at all
-Higher values encourage using new topics/concepts
-Good for avoiding repetitive themes
-Different from frequency penalty because it penalizes any reuse
-Frequency Penalty (-2.0 to 2.0)
-tokens
-Penalizes tokens based on how often they've appeared
-Higher values discourage frequent word reuse
-Good for avoiding stuck-in-a-loop repetition
-More granular than presence penalty
-# Using Temperature (controls randomness)
-llm = ChatOpenAI(
-    temperature=0.7,  # More focused
-    top_p=1.0        # Default, effectively disabled
-)
-# Using Top_p (nucleus sampling)
-llm = ChatOpenAI(
-    temperature=1.0,  # Default
-    top_p=0.7        # More focused vocabulary
-)
-Chunk Size and Overlap
-Small Chunks (100 characters):
-)
-Effects:
-More granular retrieval
-Better for finding specific lines or phrases
-Might lose broader context/themes
-Could fragment verses/concepts
-More chunks to process = potentially slower
-Overlap Effects:
-processing
-Example of Different Chunk Sizes:
-"
-Overlap Examples:
-"
-The key is finding the right balance for your specific use case. For lyrics, I'd recommend:
-Chunk size: 200-300 characters (typical verse size)
-Overlap: 50-75 characters (enough to maintain rhyme patterns and flow)
-This maintains musical phrases while allowing for good retrieval granularity

docs/PROJECT_README.md DELETED Viewed

@@ -1,125 +0,0 @@
-# SongLift LyrGen2 - AI Lyrics Generation System
-## Project Overview
-SongLift LyrGen2 is an AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. The system combines vector embeddings of existing lyrics with OpenAI's GPT-4 for creative generation.
-## Core Components
-### 1. Data Processing Pipeline
-- Raw lyrics stored in `data/raw/lyrics`
-- Processed into embeddings using ChromaDB
-- Embeddings stored in `data/processed/embeddings/chroma`
-### 2. Vector Database
-- Uses ChromaDB for vector storage
-- Collection name: "langchain"
-- Stores embeddings with metadata (artist, song title)
-- Approximately 234K embedded documents
-### 3. AI Models
-- Embeddings: OpenAI Ada-002
-- Generation: GPT-4
-- Vector similarity search for context
-## Workflow
-1. **Data Preparation**   ```bash
-   python scripts/process_lyrics.py   ```
-   - Loads raw lyrics
-   - Splits into chunks
-   - Creates embeddings
-   - Stores in ChromaDB
-2. **Deployment**   ```bash
-   python scripts/upload_embeddings.py   ```
-   - Uploads embeddings to HuggingFace dataset
-   - Used by HuggingFace Space deployment
-3. **Generation**
-   - Takes user prompt
-   - Finds similar lyrics
-   - Uses context for GPT-4 generation
-   - Returns generated lyrics with sources
-## Environment Setup
-- Local development: Uses project-relative paths
-- HuggingFace deployment: Uses persistent storage paths
-- Controlled via `DEPLOYMENT_MODE` environment variable
-## Key Files
-- `config/settings.py`: Central configuration
-- `src/generator/generator.py`: Core generation logic
-- `scripts/process_lyrics.py`: Data processing pipeline
-- `scripts/test_embeddings.py`: Database validation
-## Development Notes
-- Always use "langchain" as ChromaDB collection name
-- Embeddings are versioned in HuggingFace dataset
-- Local testing available via test scripts
-- Comprehensive logging throughout pipeline
-## Deployment
-The system is deployed as:
-- HuggingFace Space: SongLift/LyrGen2
-- Database: SongLift/LyrGen2_DB
-## Testing
-```bash
-python scripts/test_embeddings.py
-python scripts/test_semantic.py
-```
-!
-For troubleshooting and known issues, see TROUBLESHOOTING.md
-## Data Processing & Loading Workflow
-### 1. Lyrics Processing (`scripts/process_lyrics.py`)
-- Handles initial lyrics processing and embedding creation
-- Features:
-  - Batch processing with rate limiting
-  - Recursive text splitting (300 chars, 75 overlap)
-  - Automatic retry on API limits
-  - Comprehensive metadata tracking
-  - Progress monitoring with tqdm
-### 2. Data Loading (`src/utils/data_loader.py`)
-- Manages raw lyrics loading and cleaning
-- Features:
-  - Multi-encoding support (utf-8, latin-1, cp1252)
-  - Section marker detection ([Verse], [Chorus], etc.)
-  - Intelligent line breaking
-  - Metadata preservation (artist, song title)
-  - Robust validation checks
-### 3. Workflow Steps
-```bash
-# 1. Load and clean lyrics
-python scripts/process_lyrics.py
-# Creates cleaned, chunked documents with metadata
-# 2. Generate embeddings
-# (Handled automatically by process_lyrics.py)
-# Stores in data/processed/embeddings/chroma
-# 3. Upload to HuggingFace
-python scripts/upload_embeddings.py
-# Verifies and uploads to SongLift/LyrGen2_DB
-```
-### Data Flow
-```
-Raw Lyrics (txt)
-    ↓
-Data Loader (cleaning)
-    ↓
-Text Splitter (chunking)
-    ↓
-OpenAI Embeddings
-    ↓
-ChromaDB Storage
-    ↓
-HuggingFace Dataset
-```
-Each step includes validation and error handling to ensure data integrity. The process is designed to be resumable and maintains consistency between local and deployed environments.

docs/TODO.md DELETED Viewed

@@ -1,43 +0,0 @@
-merge cleanstart2 to main
-remove TODO.txt from github repo
-Figure out which LLM to use instead of OpenAI
-Test kwargs, temperature, and max_tokens
-Figure out if its already chunking lyrics or if I need to do that
-Figure out best chunk size and overlap (Set in generator.py
-Make it use existing vector store unless I change it
-Make sure Billboard #1s are in the lyric
-    Use lyrics_csv_to_txt workspace in cursor.  I used this to makes the billboards.csv into seperate files.
-Filter out the first line of each text file if it looks like this:
-ContributorsTranslationsDeutschEspañolPortuguêsFrançaisفارسیNederlands한국어DanskItalianoРусскийbreak
-maybe add the artist name and genre to the lyric files so that the LLM can use that information to write better lyrics.
-add temperature and top_p sliders in streamlit
-TODO: Consider replacing OpenAI embeddings with local model
-- Pros:
-  - No API dependency for vector search
-  - Reduced costs
-  - Faster response times (no API latency)
-- Cons:
-  - Need to validate quality against current OpenAI embeddings
-  - May require more compute resources
-  - Need to ensure consistent vector space with existing embeddings
-Potential options:
-1. HuggingFace sentence-transformers
-2. all-MiniLM-L6-v2
-3. all-mpnet-base-v2
-Next steps:
-1. Research embedding model options
-2. Test quality against OpenAI embeddings
-3. Benchmark performance
-4. Plan migration strategy

docs/TROUBLESHOOTING.md DELETED Viewed

@@ -1,44 +0,0 @@
-# Troubleshooting Guide
-## Embeddings Issues
-### Empty Chroma Collection (0 Documents)
-**Symptom:**
-- ChromaDB shows 0 documents despite large files being present
-- SQLite database shows records (e.g., embeddings: 233998 records)
-- Files exist and have expected sizes:
-  - chroma.sqlite3 (~576 MB)
-  - data_level0.bin (~1.3 GB)
-**Cause:**
-Collection name mismatch between processing and loading. The system uses two collections:
-- "langchain" (contains the data)
-- "lyrics" (empty)
-**Solution:**
-Always use "langchain" as the collection name in all operations:
-```python
-vector_store = Chroma(
-    persist_directory=str(chroma_dir),
-    embedding_function=embeddings,
-    collection_name="langchain"  # Must be "langchain"
-)
-```
-**Verification:**
-Run the test script to check collections:
-```bash
-python scripts/test_embeddings.py
-```
-Expected output:
-```
-Collection names: [Collection(name=langchain), Collection(name=lyrics)]
-Collection count: 233998  # For langchain collection
-```
-**Files to Check:**
-1. config/settings.py: CHROMA_COLLECTION_NAME
-2. src/generator/generator.py: vector_store initialization
-3. scripts/process_lyrics.py: Chroma.from_documents() call