James Edmunds commited on
Commit ·
af1818d
1
Parent(s): 40c5feb
Security improvements: Enhanced README, added .env.example, removed docs from tracking
Browse files- .env.example +11 -0
- .gitignore +3 -0
- README.md +220 -4
- docs/NOTES.md +0 -102
- docs/PROJECT_README.md +0 -125
- docs/TODO.md +0 -43
- docs/TROUBLESHOOTING.md +0 -44
.env.example
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenAI API Key
|
| 2 |
+
# Get your key from: https://platform.openai.com/api-keys
|
| 3 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 4 |
+
|
| 5 |
+
# HuggingFace Token (optional, for dataset access)
|
| 6 |
+
# Get your token from: https://huggingface.co/settings/tokens
|
| 7 |
+
HF_TOKEN=your_huggingface_token_here
|
| 8 |
+
|
| 9 |
+
# Deployment Mode
|
| 10 |
+
# 'local' for development, 'huggingface' for HF Space
|
| 11 |
+
DEPLOYMENT_MODE=local
|
.gitignore
CHANGED
|
@@ -44,6 +44,9 @@ htmlcov/
|
|
| 44 |
.env.local
|
| 45 |
TODO.txt
|
| 46 |
|
|
|
|
|
|
|
|
|
|
| 47 |
# Huggingface
|
| 48 |
.hf/
|
| 49 |
.huggingface/
|
|
|
|
| 44 |
.env.local
|
| 45 |
TODO.txt
|
| 46 |
|
| 47 |
+
# Documentation (keep private)
|
| 48 |
+
docs/
|
| 49 |
+
|
| 50 |
# Huggingface
|
| 51 |
.hf/
|
| 52 |
.huggingface/
|
README.md
CHANGED
|
@@ -9,10 +9,226 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# SongLift LyrGen2
|
| 13 |
|
| 14 |
-
An AI-powered
|
| 15 |
|
| 16 |
-
##
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
|
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# SongLift LyrGen2 🎵
|
| 13 |
|
| 14 |
+
An AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. Built with LangChain, RAG (Retrieval-Augmented Generation), and OpenAI's GPT-4.
|
| 15 |
|
| 16 |
+
## 🚀 Live Demo
|
| 17 |
+
|
| 18 |
+
**[Try it on HuggingFace Spaces](https://huggingface.co/spaces/SongLift/LyrGen2)**
|
| 19 |
+
|
| 20 |
+
## ✨ Features
|
| 21 |
+
|
| 22 |
+
- **Semantic Lyrics Generation**: Uses vector embeddings of 234K+ lyrics for contextual understanding
|
| 23 |
+
- **RAG Technology**: Retrieval-Augmented Generation finds similar lyrics to inform new creations
|
| 24 |
+
- **Modern Sensibilities**: Trained on contemporary pop and hip-hop lyrics
|
| 25 |
+
- **Interactive Web Interface**: Clean Streamlit interface for easy use
|
| 26 |
+
- **Source Attribution**: Shows which lyrics influenced the generation
|
| 27 |
+
|
| 28 |
+
## 🏗️ Architecture
|
| 29 |
+
|
| 30 |
+
### Core Components
|
| 31 |
+
- **Vector Database**: ChromaDB with OpenAI Ada-002 embeddings
|
| 32 |
+
- **AI Models**: GPT-4 for generation, Ada-002 for embeddings
|
| 33 |
+
- **Data Pipeline**: Automated processing of raw lyrics into searchable embeddings
|
| 34 |
+
- **Dual Deployment**: Local development + HuggingFace Spaces production
|
| 35 |
+
|
| 36 |
+
### Workflow
|
| 37 |
+
```
|
| 38 |
+
Raw Lyrics → Data Cleaning → Text Chunking → Embeddings → ChromaDB → Generation
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## 🛠️ Local Development
|
| 42 |
+
|
| 43 |
+
### Prerequisites
|
| 44 |
+
- Python 3.8+
|
| 45 |
+
- OpenAI API key
|
| 46 |
+
- HuggingFace token (optional, for dataset access)
|
| 47 |
+
|
| 48 |
+
### Setup
|
| 49 |
+
```bash
|
| 50 |
+
# Clone the repository
|
| 51 |
+
git clone <your-repo-url>
|
| 52 |
+
cd SongLift_LyrGen2
|
| 53 |
+
|
| 54 |
+
# Create virtual environment
|
| 55 |
+
python -m venv .venv
|
| 56 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 57 |
+
|
| 58 |
+
# Install dependencies
|
| 59 |
+
pip install -r requirements.txt
|
| 60 |
+
|
| 61 |
+
# Configure environment
|
| 62 |
+
cp .env.example .env
|
| 63 |
+
# Edit .env with your API keys
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Environment Variables
|
| 67 |
+
Create a `.env` file with:
|
| 68 |
+
```env
|
| 69 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 70 |
+
HF_TOKEN=your_huggingface_token_here
|
| 71 |
+
DEPLOYMENT_MODE=local
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Run Locally
|
| 75 |
+
```bash
|
| 76 |
+
streamlit run app.py
|
| 77 |
+
```
|
| 78 |
+
Visit `http://localhost:8501`
|
| 79 |
+
|
| 80 |
+
## 🧪 Testing & Validation
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
# Test your environment setup
|
| 84 |
+
python scripts/test_environment.py
|
| 85 |
+
|
| 86 |
+
# Test OpenAI connection
|
| 87 |
+
python scripts/test_openai_connection.py
|
| 88 |
+
|
| 89 |
+
# Validate embeddings database
|
| 90 |
+
python scripts/test_embeddings.py
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## 📊 Data Processing
|
| 94 |
+
|
| 95 |
+
The system processes lyrics through a sophisticated pipeline:
|
| 96 |
+
|
| 97 |
+
1. **Raw Data Loading** (`scripts/process_lyrics.py`)
|
| 98 |
+
- Multi-encoding support (UTF-8, Latin-1, CP1252)
|
| 99 |
+
- Section detection ([Verse], [Chorus], etc.)
|
| 100 |
+
- Metadata preservation
|
| 101 |
+
|
| 102 |
+
2. **Text Processing**
|
| 103 |
+
- Recursive text splitting (300 chars, 75 overlap)
|
| 104 |
+
- Batch processing with rate limiting
|
| 105 |
+
- Automatic retry on API limits
|
| 106 |
+
|
| 107 |
+
3. **Vector Storage**
|
| 108 |
+
- ChromaDB collection: "lyrics_v1"
|
| 109 |
+
- ~234K embedded documents
|
| 110 |
+
- Metadata tracking (artist, song title)
|
| 111 |
+
|
| 112 |
+
## 🚀 Deployment
|
| 113 |
+
|
| 114 |
+
### HuggingFace Spaces
|
| 115 |
+
The app auto-deploys to HuggingFace Spaces via GitHub sync:
|
| 116 |
+
- **Space**: [SongLift/LyrGen2](https://huggingface.co/spaces/SongLift/LyrGen2)
|
| 117 |
+
- **Dataset**: [SongLift/LyrGen2_DB](https://huggingface.co/datasets/SongLift/LyrGen2_DB)
|
| 118 |
+
|
| 119 |
+
Configure secrets in HF Spaces settings:
|
| 120 |
+
- `OPENAI_API_KEY`
|
| 121 |
+
- `HF_TOKEN`
|
| 122 |
+
|
| 123 |
+
### Local to Production Sync
|
| 124 |
+
```bash
|
| 125 |
+
# Process and upload embeddings
|
| 126 |
+
python scripts/process_lyrics.py
|
| 127 |
+
python scripts/upload_embeddings.py
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## 🔧 Configuration
|
| 131 |
+
|
| 132 |
+
Key configuration in `config/settings.py`:
|
| 133 |
+
- **Models**: GPT-4 for generation, Ada-002 for embeddings
|
| 134 |
+
- **Paths**: Auto-detects local vs HuggingFace environment
|
| 135 |
+
- **Database**: ChromaDB with persistent storage
|
| 136 |
+
|
| 137 |
+
## 📁 Project Structure
|
| 138 |
+
|
| 139 |
+
```
|
| 140 |
+
SongLift_LyrGen2/
|
| 141 |
+
├── app.py # Main Streamlit application
|
| 142 |
+
├── config/
|
| 143 |
+
│ └── settings.py # Central configuration
|
| 144 |
+
├── src/
|
| 145 |
+
│ ├── generator/ # Core generation logic
|
| 146 |
+
│ └── utils/ # Utility functions
|
| 147 |
+
├── scripts/ # Data processing & testing
|
| 148 |
+
├── data/
|
| 149 |
+
│ ├── raw/lyrics/ # Original lyrics files
|
| 150 |
+
│ └── processed/ # Embeddings & processed data
|
| 151 |
+
└── docs/ # Documentation
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## 🔍 Browser Compatibility
|
| 155 |
+
⚠️ **Recommended**: Chrome or Chromium-based browsers for optimal performance. Some features may not work correctly in Safari.
|
| 156 |
+
|
| 157 |
+
## � HouggingFace Spaces Setup
|
| 158 |
+
|
| 159 |
+
### Deploy Your Own Space
|
| 160 |
+
|
| 161 |
+
1. **Create a HuggingFace Space**:
|
| 162 |
+
- Go to [HuggingFace Spaces](https://huggingface.co/spaces)
|
| 163 |
+
- Click "Create new Space"
|
| 164 |
+
- Choose "Streamlit" as SDK
|
| 165 |
+
- Set `app_file: app.py`
|
| 166 |
+
|
| 167 |
+
2. **Configure Secrets**:
|
| 168 |
+
- In your Space settings, add these secrets:
|
| 169 |
+
- `OPENAI_API_KEY`: Your OpenAI API key
|
| 170 |
+
- `HF_TOKEN`: Your HuggingFace token (for dataset access)
|
| 171 |
+
|
| 172 |
+
3. **Upload Your Dataset**:
|
| 173 |
+
```bash
|
| 174 |
+
# Process and upload embeddings to HF dataset
|
| 175 |
+
python scripts/process_lyrics.py
|
| 176 |
+
python scripts/upload_embeddings.py
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
4. **Sync with GitHub** (optional):
|
| 180 |
+
- Connect your Space to a GitHub repo for automatic deployments
|
| 181 |
+
- Push changes to GitHub → auto-deploys to HF Spaces
|
| 182 |
+
|
| 183 |
+
### Running HuggingFace Locally
|
| 184 |
+
|
| 185 |
+
You can test the HuggingFace environment locally:
|
| 186 |
+
|
| 187 |
+
```bash
|
| 188 |
+
# Set HuggingFace mode
|
| 189 |
+
export DEPLOYMENT_MODE=huggingface
|
| 190 |
+
|
| 191 |
+
# Run locally (will use HF dataset paths)
|
| 192 |
+
streamlit run app.py
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
This helps debug HF-specific issues before deploying.
|
| 196 |
+
|
| 197 |
+
## 🤝 Contributing
|
| 198 |
+
|
| 199 |
+
1. Fork the repository
|
| 200 |
+
2. Create a feature branch
|
| 201 |
+
3. Make your changes
|
| 202 |
+
4. Add tests if applicable
|
| 203 |
+
5. Submit a pull request
|
| 204 |
+
|
| 205 |
+
## 📄 License
|
| 206 |
+
|
| 207 |
+
MIT License
|
| 208 |
+
|
| 209 |
+
Copyright (c) 2024 SongLift
|
| 210 |
+
|
| 211 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 212 |
+
of this software and associated documentation files (the "Software"), to deal
|
| 213 |
+
in the Software without restriction, including without limitation the rights
|
| 214 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 215 |
+
copies of the Software, and to permit persons to whom the Software is
|
| 216 |
+
furnished to do so, subject to the following conditions:
|
| 217 |
+
|
| 218 |
+
The above copyright notice and this permission notice shall be included in all
|
| 219 |
+
copies or substantial portions of the Software.
|
| 220 |
+
|
| 221 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 222 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 223 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 224 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 225 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 226 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
| 227 |
+
SOFTWARE.
|
| 228 |
+
|
| 229 |
+
## 🙏 Acknowledgments
|
| 230 |
+
|
| 231 |
+
- Built with [LangChain](https://langchain.com/) and [Streamlit](https://streamlit.io/)
|
| 232 |
+
- Powered by [OpenAI](https://openai.com/) and [HuggingFace](https://huggingface.co/)
|
| 233 |
+
- Vector storage by [ChromaDB](https://www.trychroma.com/)
|
| 234 |
|
docs/NOTES.md
DELETED
|
@@ -1,102 +0,0 @@
|
|
| 1 |
-
|
| 2 |
-
Notes:
|
| 3 |
-
|
| 4 |
-
I could fine tune an openai model and access it via langchain if i wanted to:
|
| 5 |
-
|
| 6 |
-
https://python.langchain.com/docs/integrations/chat/openai/
|
| 7 |
-
|
| 8 |
-
This generally takes the form of ft:{OPENAI_MODEL_NAME}:{ORG_NAME}::{MODEL_ID}. For example:
|
| 9 |
-
|
| 10 |
-
fine_tuned_model = ChatOpenAI(
|
| 11 |
-
temperature=0, model_name="ft:gpt-3.5-turbo-0613:langchain::7qTVM5AR"
|
| 12 |
-
)
|
| 13 |
-
|
| 14 |
-
fine_tuned_model.invoke(messages)
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
IMPORTANT SECTIONS:
|
| 19 |
-
generator.py:
|
| 20 |
-
def setup_qa_chain(self):
|
| 21 |
-
"""Initialize the QA chain for generating lyrics"""
|
| 22 |
-
retriever = self.vector_store.as_retriever(
|
| 23 |
-
search_kwargs={
|
| 24 |
-
"k": 500, # Increased from 3 to 10
|
| 25 |
-
"fetch_k": 5000, # Fetch more candidates before selecting top k
|
| 26 |
-
"score_threshold": 0.7 # Only use relevant documents
|
| 27 |
-
}
|
| 28 |
-
|
| 29 |
-
MODEL kwarg meanings:
|
| 30 |
-
Default Penalties
|
| 31 |
-
Default values are 0.0 for both presence and frequency penalties. This means:
|
| 32 |
-
No additional penalty for repeated tokens
|
| 33 |
-
No additional penalty for using common tokens
|
| 34 |
-
|
| 35 |
-
Effects:
|
| 36 |
-
Higher temperature: More creative/random → better for unique lyrics
|
| 37 |
-
Higher top_p: More vocabulary variety → richer language
|
| 38 |
-
Higher presence/frequency penalties: Less repetition → more unique phrases
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
Parameter Deep Dive
|
| 42 |
-
Temperature (0.0 - 2.0, default 1.0)
|
| 43 |
-
creative
|
| 44 |
-
Controls randomness in token selection
|
| 45 |
-
Lower = more deterministic, focused, and conservative
|
| 46 |
-
Higher = more creative, random, and potentially chaotic
|
| 47 |
-
Think of it as "creativity vs consistency"
|
| 48 |
-
Top_p (0.0 - 1.0, default 1.0)
|
| 49 |
-
mass
|
| 50 |
-
- Controls diversity by limiting cumulative probability of next tokens
|
| 51 |
-
Similar to temperature but more precise
|
| 52 |
-
Lower values = more focused vocabulary
|
| 53 |
-
Higher values = more diverse word choices
|
| 54 |
-
Often preferred over temperature for creative tasks
|
| 55 |
-
Presence Penalty (-2.0 to 2.0)
|
| 56 |
-
tokens
|
| 57 |
-
- Penalizes tokens that have appeared at all
|
| 58 |
-
Higher values encourage using new topics/concepts
|
| 59 |
-
Good for avoiding repetitive themes
|
| 60 |
-
Different from frequency penalty because it penalizes any reuse
|
| 61 |
-
Frequency Penalty (-2.0 to 2.0)
|
| 62 |
-
tokens
|
| 63 |
-
Penalizes tokens based on how often they've appeared
|
| 64 |
-
Higher values discourage frequent word reuse
|
| 65 |
-
Good for avoiding stuck-in-a-loop repetition
|
| 66 |
-
More granular than presence penalty
|
| 67 |
-
|
| 68 |
-
# Using Temperature (controls randomness)
|
| 69 |
-
llm = ChatOpenAI(
|
| 70 |
-
temperature=0.7, # More focused
|
| 71 |
-
top_p=1.0 # Default, effectively disabled
|
| 72 |
-
)
|
| 73 |
-
|
| 74 |
-
# Using Top_p (nucleus sampling)
|
| 75 |
-
llm = ChatOpenAI(
|
| 76 |
-
temperature=1.0, # Default
|
| 77 |
-
top_p=0.7 # More focused vocabulary
|
| 78 |
-
)
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
Chunk Size and Overlap
|
| 85 |
-
Small Chunks (100 characters):
|
| 86 |
-
)
|
| 87 |
-
Effects:
|
| 88 |
-
More granular retrieval
|
| 89 |
-
Better for finding specific lines or phrases
|
| 90 |
-
Might lose broader context/themes
|
| 91 |
-
Could fragment verses/concepts
|
| 92 |
-
More chunks to process = potentially slower
|
| 93 |
-
Overlap Effects:
|
| 94 |
-
processing
|
| 95 |
-
Example of Different Chunk Sizes:
|
| 96 |
-
"
|
| 97 |
-
Overlap Examples:
|
| 98 |
-
"
|
| 99 |
-
The key is finding the right balance for your specific use case. For lyrics, I'd recommend:
|
| 100 |
-
Chunk size: 200-300 characters (typical verse size)
|
| 101 |
-
Overlap: 50-75 characters (enough to maintain rhyme patterns and flow)
|
| 102 |
-
This maintains musical phrases while allowing for good retrieval granularity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/PROJECT_README.md
DELETED
|
@@ -1,125 +0,0 @@
|
|
| 1 |
-
# SongLift LyrGen2 - AI Lyrics Generation System
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
## Project Overview
|
| 5 |
-
SongLift LyrGen2 is an AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. The system combines vector embeddings of existing lyrics with OpenAI's GPT-4 for creative generation.
|
| 6 |
-
|
| 7 |
-
## Core Components
|
| 8 |
-
|
| 9 |
-
### 1. Data Processing Pipeline
|
| 10 |
-
- Raw lyrics stored in `data/raw/lyrics`
|
| 11 |
-
- Processed into embeddings using ChromaDB
|
| 12 |
-
- Embeddings stored in `data/processed/embeddings/chroma`
|
| 13 |
-
|
| 14 |
-
### 2. Vector Database
|
| 15 |
-
- Uses ChromaDB for vector storage
|
| 16 |
-
- Collection name: "langchain"
|
| 17 |
-
- Stores embeddings with metadata (artist, song title)
|
| 18 |
-
- Approximately 234K embedded documents
|
| 19 |
-
|
| 20 |
-
### 3. AI Models
|
| 21 |
-
- Embeddings: OpenAI Ada-002
|
| 22 |
-
- Generation: GPT-4
|
| 23 |
-
- Vector similarity search for context
|
| 24 |
-
|
| 25 |
-
## Workflow
|
| 26 |
-
|
| 27 |
-
1. **Data Preparation** ```bash
|
| 28 |
-
python scripts/process_lyrics.py ```
|
| 29 |
-
- Loads raw lyrics
|
| 30 |
-
- Splits into chunks
|
| 31 |
-
- Creates embeddings
|
| 32 |
-
- Stores in ChromaDB
|
| 33 |
-
|
| 34 |
-
2. **Deployment** ```bash
|
| 35 |
-
python scripts/upload_embeddings.py ```
|
| 36 |
-
- Uploads embeddings to HuggingFace dataset
|
| 37 |
-
- Used by HuggingFace Space deployment
|
| 38 |
-
|
| 39 |
-
3. **Generation**
|
| 40 |
-
- Takes user prompt
|
| 41 |
-
- Finds similar lyrics
|
| 42 |
-
- Uses context for GPT-4 generation
|
| 43 |
-
- Returns generated lyrics with sources
|
| 44 |
-
|
| 45 |
-
## Environment Setup
|
| 46 |
-
- Local development: Uses project-relative paths
|
| 47 |
-
- HuggingFace deployment: Uses persistent storage paths
|
| 48 |
-
- Controlled via `DEPLOYMENT_MODE` environment variable
|
| 49 |
-
|
| 50 |
-
## Key Files
|
| 51 |
-
- `config/settings.py`: Central configuration
|
| 52 |
-
- `src/generator/generator.py`: Core generation logic
|
| 53 |
-
- `scripts/process_lyrics.py`: Data processing pipeline
|
| 54 |
-
- `scripts/test_embeddings.py`: Database validation
|
| 55 |
-
|
| 56 |
-
## Development Notes
|
| 57 |
-
- Always use "langchain" as ChromaDB collection name
|
| 58 |
-
- Embeddings are versioned in HuggingFace dataset
|
| 59 |
-
- Local testing available via test scripts
|
| 60 |
-
- Comprehensive logging throughout pipeline
|
| 61 |
-
|
| 62 |
-
## Deployment
|
| 63 |
-
The system is deployed as:
|
| 64 |
-
- HuggingFace Space: SongLift/LyrGen2
|
| 65 |
-
- Database: SongLift/LyrGen2_DB
|
| 66 |
-
|
| 67 |
-
## Testing
|
| 68 |
-
```bash
|
| 69 |
-
python scripts/test_embeddings.py
|
| 70 |
-
python scripts/test_semantic.py
|
| 71 |
-
```
|
| 72 |
-
!
|
| 73 |
-
For troubleshooting and known issues, see TROUBLESHOOTING.md
|
| 74 |
-
|
| 75 |
-
## Data Processing & Loading Workflow
|
| 76 |
-
|
| 77 |
-
### 1. Lyrics Processing (`scripts/process_lyrics.py`)
|
| 78 |
-
- Handles initial lyrics processing and embedding creation
|
| 79 |
-
- Features:
|
| 80 |
-
- Batch processing with rate limiting
|
| 81 |
-
- Recursive text splitting (300 chars, 75 overlap)
|
| 82 |
-
- Automatic retry on API limits
|
| 83 |
-
- Comprehensive metadata tracking
|
| 84 |
-
- Progress monitoring with tqdm
|
| 85 |
-
|
| 86 |
-
### 2. Data Loading (`src/utils/data_loader.py`)
|
| 87 |
-
- Manages raw lyrics loading and cleaning
|
| 88 |
-
- Features:
|
| 89 |
-
- Multi-encoding support (utf-8, latin-1, cp1252)
|
| 90 |
-
- Section marker detection ([Verse], [Chorus], etc.)
|
| 91 |
-
- Intelligent line breaking
|
| 92 |
-
- Metadata preservation (artist, song title)
|
| 93 |
-
- Robust validation checks
|
| 94 |
-
|
| 95 |
-
### 3. Workflow Steps
|
| 96 |
-
```bash
|
| 97 |
-
# 1. Load and clean lyrics
|
| 98 |
-
python scripts/process_lyrics.py
|
| 99 |
-
# Creates cleaned, chunked documents with metadata
|
| 100 |
-
|
| 101 |
-
# 2. Generate embeddings
|
| 102 |
-
# (Handled automatically by process_lyrics.py)
|
| 103 |
-
# Stores in data/processed/embeddings/chroma
|
| 104 |
-
|
| 105 |
-
# 3. Upload to HuggingFace
|
| 106 |
-
python scripts/upload_embeddings.py
|
| 107 |
-
# Verifies and uploads to SongLift/LyrGen2_DB
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
### Data Flow
|
| 111 |
-
```
|
| 112 |
-
Raw Lyrics (txt)
|
| 113 |
-
↓
|
| 114 |
-
Data Loader (cleaning)
|
| 115 |
-
↓
|
| 116 |
-
Text Splitter (chunking)
|
| 117 |
-
↓
|
| 118 |
-
OpenAI Embeddings
|
| 119 |
-
↓
|
| 120 |
-
ChromaDB Storage
|
| 121 |
-
↓
|
| 122 |
-
HuggingFace Dataset
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
Each step includes validation and error handling to ensure data integrity. The process is designed to be resumable and maintains consistency between local and deployed environments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TODO.md
DELETED
|
@@ -1,43 +0,0 @@
|
|
| 1 |
-
merge cleanstart2 to main
|
| 2 |
-
remove TODO.txt from github repo
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
Figure out which LLM to use instead of OpenAI
|
| 7 |
-
Test kwargs, temperature, and max_tokens
|
| 8 |
-
Figure out if its already chunking lyrics or if I need to do that
|
| 9 |
-
Figure out best chunk size and overlap (Set in generator.py
|
| 10 |
-
Make it use existing vector store unless I change it
|
| 11 |
-
Make sure Billboard #1s are in the lyric
|
| 12 |
-
Use lyrics_csv_to_txt workspace in cursor. I used this to makes the billboards.csv into seperate files.
|
| 13 |
-
|
| 14 |
-
Filter out the first line of each text file if it looks like this:
|
| 15 |
-
ContributorsTranslationsDeutschEspañolPortuguêsFrançaisفارسیNederlands한국어DanskItalianoРусскийbreak
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
maybe add the artist name and genre to the lyric files so that the LLM can use that information to write better lyrics.
|
| 19 |
-
|
| 20 |
-
add temperature and top_p sliders in streamlit
|
| 21 |
-
|
| 22 |
-
TODO: Consider replacing OpenAI embeddings with local model
|
| 23 |
-
- Pros:
|
| 24 |
-
- No API dependency for vector search
|
| 25 |
-
- Reduced costs
|
| 26 |
-
- Faster response times (no API latency)
|
| 27 |
-
- Cons:
|
| 28 |
-
- Need to validate quality against current OpenAI embeddings
|
| 29 |
-
- May require more compute resources
|
| 30 |
-
- Need to ensure consistent vector space with existing embeddings
|
| 31 |
-
|
| 32 |
-
Potential options:
|
| 33 |
-
1. HuggingFace sentence-transformers
|
| 34 |
-
2. all-MiniLM-L6-v2
|
| 35 |
-
3. all-mpnet-base-v2
|
| 36 |
-
|
| 37 |
-
Next steps:
|
| 38 |
-
1. Research embedding model options
|
| 39 |
-
2. Test quality against OpenAI embeddings
|
| 40 |
-
3. Benchmark performance
|
| 41 |
-
4. Plan migration strategy
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/TROUBLESHOOTING.md
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
# Troubleshooting Guide
|
| 2 |
-
|
| 3 |
-
## Embeddings Issues
|
| 4 |
-
|
| 5 |
-
### Empty Chroma Collection (0 Documents)
|
| 6 |
-
**Symptom:**
|
| 7 |
-
- ChromaDB shows 0 documents despite large files being present
|
| 8 |
-
- SQLite database shows records (e.g., embeddings: 233998 records)
|
| 9 |
-
- Files exist and have expected sizes:
|
| 10 |
-
- chroma.sqlite3 (~576 MB)
|
| 11 |
-
- data_level0.bin (~1.3 GB)
|
| 12 |
-
|
| 13 |
-
**Cause:**
|
| 14 |
-
Collection name mismatch between processing and loading. The system uses two collections:
|
| 15 |
-
- "langchain" (contains the data)
|
| 16 |
-
- "lyrics" (empty)
|
| 17 |
-
|
| 18 |
-
**Solution:**
|
| 19 |
-
Always use "langchain" as the collection name in all operations:
|
| 20 |
-
```python
|
| 21 |
-
vector_store = Chroma(
|
| 22 |
-
persist_directory=str(chroma_dir),
|
| 23 |
-
embedding_function=embeddings,
|
| 24 |
-
collection_name="langchain" # Must be "langchain"
|
| 25 |
-
)
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
**Verification:**
|
| 29 |
-
Run the test script to check collections:
|
| 30 |
-
|
| 31 |
-
```bash
|
| 32 |
-
python scripts/test_embeddings.py
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
Expected output:
|
| 36 |
-
```
|
| 37 |
-
Collection names: [Collection(name=langchain), Collection(name=lyrics)]
|
| 38 |
-
Collection count: 233998 # For langchain collection
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
**Files to Check:**
|
| 42 |
-
1. config/settings.py: CHROMA_COLLECTION_NAME
|
| 43 |
-
2. src/generator/generator.py: vector_store initialization
|
| 44 |
-
3. scripts/process_lyrics.py: Chroma.from_documents() call
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|