James Edmunds commited on
Commit
af1818d
·
1 Parent(s): 40c5feb

Security improvements: Enhanced README, added .env.example, removed docs from tracking

Browse files
Files changed (7) hide show
  1. .env.example +11 -0
  2. .gitignore +3 -0
  3. README.md +220 -4
  4. docs/NOTES.md +0 -102
  5. docs/PROJECT_README.md +0 -125
  6. docs/TODO.md +0 -43
  7. docs/TROUBLESHOOTING.md +0 -44
.env.example ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAI API Key
2
+ # Get your key from: https://platform.openai.com/api-keys
3
+ OPENAI_API_KEY=your_openai_api_key_here
4
+
5
+ # HuggingFace Token (optional, for dataset access)
6
+ # Get your token from: https://huggingface.co/settings/tokens
7
+ HF_TOKEN=your_huggingface_token_here
8
+
9
+ # Deployment Mode
10
+ # 'local' for development, 'huggingface' for HF Space
11
+ DEPLOYMENT_MODE=local
.gitignore CHANGED
@@ -44,6 +44,9 @@ htmlcov/
44
  .env.local
45
  TODO.txt
46
 
 
 
 
47
  # Huggingface
48
  .hf/
49
  .huggingface/
 
44
  .env.local
45
  TODO.txt
46
 
47
+ # Documentation (keep private)
48
+ docs/
49
+
50
  # Huggingface
51
  .hf/
52
  .huggingface/
README.md CHANGED
@@ -9,10 +9,226 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # SongLift LyrGen2
13
 
14
- An AI-powered lyric generator using LangChain and RAG (Retrieval-Augmented Generation) technology. This application leverages OpenAI's models and a database of existing lyrics to generate new, contextually-aware song lyrics with modern pop and hip-hop sensibilities.
15
 
16
- ## Browser Compatibility
17
- ⚠️ For best results, use Chrome or another Chromium-based browser. Some features may not work correctly in Safari.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
 
9
  pinned: false
10
  ---
11
 
12
+ # SongLift LyrGen2 🎵
13
 
14
+ An AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. Built with LangChain, RAG (Retrieval-Augmented Generation), and OpenAI's GPT-4.
15
 
16
+ ## 🚀 Live Demo
17
+
18
+ **[Try it on HuggingFace Spaces](https://huggingface.co/spaces/SongLift/LyrGen2)**
19
+
20
+ ## ✨ Features
21
+
22
+ - **Semantic Lyrics Generation**: Uses vector embeddings of 234K+ lyrics for contextual understanding
23
+ - **RAG Technology**: Retrieval-Augmented Generation finds similar lyrics to inform new creations
24
+ - **Modern Sensibilities**: Trained on contemporary pop and hip-hop lyrics
25
+ - **Interactive Web Interface**: Clean Streamlit interface for easy use
26
+ - **Source Attribution**: Shows which lyrics influenced the generation
27
+
28
+ ## 🏗️ Architecture
29
+
30
+ ### Core Components
31
+ - **Vector Database**: ChromaDB with OpenAI Ada-002 embeddings
32
+ - **AI Models**: GPT-4 for generation, Ada-002 for embeddings
33
+ - **Data Pipeline**: Automated processing of raw lyrics into searchable embeddings
34
+ - **Dual Deployment**: Local development + HuggingFace Spaces production
35
+
36
+ ### Workflow
37
+ ```
38
+ Raw Lyrics → Data Cleaning → Text Chunking → Embeddings → ChromaDB → Generation
39
+ ```
40
+
41
+ ## 🛠️ Local Development
42
+
43
+ ### Prerequisites
44
+ - Python 3.8+
45
+ - OpenAI API key
46
+ - HuggingFace token (optional, for dataset access)
47
+
48
+ ### Setup
49
+ ```bash
50
+ # Clone the repository
51
+ git clone <your-repo-url>
52
+ cd SongLift_LyrGen2
53
+
54
+ # Create virtual environment
55
+ python -m venv .venv
56
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
57
+
58
+ # Install dependencies
59
+ pip install -r requirements.txt
60
+
61
+ # Configure environment
62
+ cp .env.example .env
63
+ # Edit .env with your API keys
64
+ ```
65
+
66
+ ### Environment Variables
67
+ Create a `.env` file with:
68
+ ```env
69
+ OPENAI_API_KEY=your_openai_api_key_here
70
+ HF_TOKEN=your_huggingface_token_here
71
+ DEPLOYMENT_MODE=local
72
+ ```
73
+
74
+ ### Run Locally
75
+ ```bash
76
+ streamlit run app.py
77
+ ```
78
+ Visit `http://localhost:8501`
79
+
80
+ ## 🧪 Testing & Validation
81
+
82
+ ```bash
83
+ # Test your environment setup
84
+ python scripts/test_environment.py
85
+
86
+ # Test OpenAI connection
87
+ python scripts/test_openai_connection.py
88
+
89
+ # Validate embeddings database
90
+ python scripts/test_embeddings.py
91
+ ```
92
+
93
+ ## 📊 Data Processing
94
+
95
+ The system processes lyrics through a sophisticated pipeline:
96
+
97
+ 1. **Raw Data Loading** (`scripts/process_lyrics.py`)
98
+ - Multi-encoding support (UTF-8, Latin-1, CP1252)
99
+ - Section detection ([Verse], [Chorus], etc.)
100
+ - Metadata preservation
101
+
102
+ 2. **Text Processing**
103
+ - Recursive text splitting (300 chars, 75 overlap)
104
+ - Batch processing with rate limiting
105
+ - Automatic retry on API limits
106
+
107
+ 3. **Vector Storage**
108
+ - ChromaDB collection: "lyrics_v1"
109
+ - ~234K embedded documents
110
+ - Metadata tracking (artist, song title)
111
+
112
+ ## 🚀 Deployment
113
+
114
+ ### HuggingFace Spaces
115
+ The app auto-deploys to HuggingFace Spaces via GitHub sync:
116
+ - **Space**: [SongLift/LyrGen2](https://huggingface.co/spaces/SongLift/LyrGen2)
117
+ - **Dataset**: [SongLift/LyrGen2_DB](https://huggingface.co/datasets/SongLift/LyrGen2_DB)
118
+
119
+ Configure secrets in HF Spaces settings:
120
+ - `OPENAI_API_KEY`
121
+ - `HF_TOKEN`
122
+
123
+ ### Local to Production Sync
124
+ ```bash
125
+ # Process and upload embeddings
126
+ python scripts/process_lyrics.py
127
+ python scripts/upload_embeddings.py
128
+ ```
129
+
130
+ ## 🔧 Configuration
131
+
132
+ Key configuration in `config/settings.py`:
133
+ - **Models**: GPT-4 for generation, Ada-002 for embeddings
134
+ - **Paths**: Auto-detects local vs HuggingFace environment
135
+ - **Database**: ChromaDB with persistent storage
136
+
137
+ ## 📁 Project Structure
138
+
139
+ ```
140
+ SongLift_LyrGen2/
141
+ ├── app.py # Main Streamlit application
142
+ ├── config/
143
+ │ └── settings.py # Central configuration
144
+ ├── src/
145
+ │ ├── generator/ # Core generation logic
146
+ │ └── utils/ # Utility functions
147
+ ├── scripts/ # Data processing & testing
148
+ ├── data/
149
+ │ ├── raw/lyrics/ # Original lyrics files
150
+ │ └── processed/ # Embeddings & processed data
151
+ └── docs/ # Documentation
152
+ ```
153
+
154
+ ## 🔍 Browser Compatibility
155
+ ⚠️ **Recommended**: Chrome or Chromium-based browsers for optimal performance. Some features may not work correctly in Safari.
156
+
157
+ ## � HouggingFace Spaces Setup
158
+
159
+ ### Deploy Your Own Space
160
+
161
+ 1. **Create a HuggingFace Space**:
162
+ - Go to [HuggingFace Spaces](https://huggingface.co/spaces)
163
+ - Click "Create new Space"
164
+ - Choose "Streamlit" as SDK
165
+ - Set `app_file: app.py`
166
+
167
+ 2. **Configure Secrets**:
168
+ - In your Space settings, add these secrets:
169
+ - `OPENAI_API_KEY`: Your OpenAI API key
170
+ - `HF_TOKEN`: Your HuggingFace token (for dataset access)
171
+
172
+ 3. **Upload Your Dataset**:
173
+ ```bash
174
+ # Process and upload embeddings to HF dataset
175
+ python scripts/process_lyrics.py
176
+ python scripts/upload_embeddings.py
177
+ ```
178
+
179
+ 4. **Sync with GitHub** (optional):
180
+ - Connect your Space to a GitHub repo for automatic deployments
181
+ - Push changes to GitHub → auto-deploys to HF Spaces
182
+
183
+ ### Running HuggingFace Locally
184
+
185
+ You can test the HuggingFace environment locally:
186
+
187
+ ```bash
188
+ # Set HuggingFace mode
189
+ export DEPLOYMENT_MODE=huggingface
190
+
191
+ # Run locally (will use HF dataset paths)
192
+ streamlit run app.py
193
+ ```
194
+
195
+ This helps debug HF-specific issues before deploying.
196
+
197
+ ## 🤝 Contributing
198
+
199
+ 1. Fork the repository
200
+ 2. Create a feature branch
201
+ 3. Make your changes
202
+ 4. Add tests if applicable
203
+ 5. Submit a pull request
204
+
205
+ ## 📄 License
206
+
207
+ MIT License
208
+
209
+ Copyright (c) 2024 SongLift
210
+
211
+ Permission is hereby granted, free of charge, to any person obtaining a copy
212
+ of this software and associated documentation files (the "Software"), to deal
213
+ in the Software without restriction, including without limitation the rights
214
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
215
+ copies of the Software, and to permit persons to whom the Software is
216
+ furnished to do so, subject to the following conditions:
217
+
218
+ The above copyright notice and this permission notice shall be included in all
219
+ copies or substantial portions of the Software.
220
+
221
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
222
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
223
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
224
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
225
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
226
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
227
+ SOFTWARE.
228
+
229
+ ## 🙏 Acknowledgments
230
+
231
+ - Built with [LangChain](https://langchain.com/) and [Streamlit](https://streamlit.io/)
232
+ - Powered by [OpenAI](https://openai.com/) and [HuggingFace](https://huggingface.co/)
233
+ - Vector storage by [ChromaDB](https://www.trychroma.com/)
234
 
docs/NOTES.md DELETED
@@ -1,102 +0,0 @@
1
-
2
- Notes:
3
-
4
- I could fine tune an openai model and access it via langchain if i wanted to:
5
-
6
- https://python.langchain.com/docs/integrations/chat/openai/
7
-
8
- This generally takes the form of ft:{OPENAI_MODEL_NAME}:{ORG_NAME}::{MODEL_ID}. For example:
9
-
10
- fine_tuned_model = ChatOpenAI(
11
- temperature=0, model_name="ft:gpt-3.5-turbo-0613:langchain::7qTVM5AR"
12
- )
13
-
14
- fine_tuned_model.invoke(messages)
15
-
16
-
17
-
18
- IMPORTANT SECTIONS:
19
- generator.py:
20
- def setup_qa_chain(self):
21
- """Initialize the QA chain for generating lyrics"""
22
- retriever = self.vector_store.as_retriever(
23
- search_kwargs={
24
- "k": 500, # Increased from 3 to 10
25
- "fetch_k": 5000, # Fetch more candidates before selecting top k
26
- "score_threshold": 0.7 # Only use relevant documents
27
- }
28
-
29
- MODEL kwarg meanings:
30
- Default Penalties
31
- Default values are 0.0 for both presence and frequency penalties. This means:
32
- No additional penalty for repeated tokens
33
- No additional penalty for using common tokens
34
-
35
- Effects:
36
- Higher temperature: More creative/random → better for unique lyrics
37
- Higher top_p: More vocabulary variety → richer language
38
- Higher presence/frequency penalties: Less repetition → more unique phrases
39
-
40
-
41
- Parameter Deep Dive
42
- Temperature (0.0 - 2.0, default 1.0)
43
- creative
44
- Controls randomness in token selection
45
- Lower = more deterministic, focused, and conservative
46
- Higher = more creative, random, and potentially chaotic
47
- Think of it as "creativity vs consistency"
48
- Top_p (0.0 - 1.0, default 1.0)
49
- mass
50
- - Controls diversity by limiting cumulative probability of next tokens
51
- Similar to temperature but more precise
52
- Lower values = more focused vocabulary
53
- Higher values = more diverse word choices
54
- Often preferred over temperature for creative tasks
55
- Presence Penalty (-2.0 to 2.0)
56
- tokens
57
- - Penalizes tokens that have appeared at all
58
- Higher values encourage using new topics/concepts
59
- Good for avoiding repetitive themes
60
- Different from frequency penalty because it penalizes any reuse
61
- Frequency Penalty (-2.0 to 2.0)
62
- tokens
63
- Penalizes tokens based on how often they've appeared
64
- Higher values discourage frequent word reuse
65
- Good for avoiding stuck-in-a-loop repetition
66
- More granular than presence penalty
67
-
68
- # Using Temperature (controls randomness)
69
- llm = ChatOpenAI(
70
- temperature=0.7, # More focused
71
- top_p=1.0 # Default, effectively disabled
72
- )
73
-
74
- # Using Top_p (nucleus sampling)
75
- llm = ChatOpenAI(
76
- temperature=1.0, # Default
77
- top_p=0.7 # More focused vocabulary
78
- )
79
-
80
-
81
-
82
-
83
-
84
- Chunk Size and Overlap
85
- Small Chunks (100 characters):
86
- )
87
- Effects:
88
- More granular retrieval
89
- Better for finding specific lines or phrases
90
- Might lose broader context/themes
91
- Could fragment verses/concepts
92
- More chunks to process = potentially slower
93
- Overlap Effects:
94
- processing
95
- Example of Different Chunk Sizes:
96
- "
97
- Overlap Examples:
98
- "
99
- The key is finding the right balance for your specific use case. For lyrics, I'd recommend:
100
- Chunk size: 200-300 characters (typical verse size)
101
- Overlap: 50-75 characters (enough to maintain rhyme patterns and flow)
102
- This maintains musical phrases while allowing for good retrieval granularity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PROJECT_README.md DELETED
@@ -1,125 +0,0 @@
1
- # SongLift LyrGen2 - AI Lyrics Generation System
2
-
3
-
4
- ## Project Overview
5
- SongLift LyrGen2 is an AI-powered lyrics generation system that uses semantic understanding of existing lyrics to generate new, contextually relevant song lyrics. The system combines vector embeddings of existing lyrics with OpenAI's GPT-4 for creative generation.
6
-
7
- ## Core Components
8
-
9
- ### 1. Data Processing Pipeline
10
- - Raw lyrics stored in `data/raw/lyrics`
11
- - Processed into embeddings using ChromaDB
12
- - Embeddings stored in `data/processed/embeddings/chroma`
13
-
14
- ### 2. Vector Database
15
- - Uses ChromaDB for vector storage
16
- - Collection name: "langchain"
17
- - Stores embeddings with metadata (artist, song title)
18
- - Approximately 234K embedded documents
19
-
20
- ### 3. AI Models
21
- - Embeddings: OpenAI Ada-002
22
- - Generation: GPT-4
23
- - Vector similarity search for context
24
-
25
- ## Workflow
26
-
27
- 1. **Data Preparation** ```bash
28
- python scripts/process_lyrics.py ```
29
- - Loads raw lyrics
30
- - Splits into chunks
31
- - Creates embeddings
32
- - Stores in ChromaDB
33
-
34
- 2. **Deployment** ```bash
35
- python scripts/upload_embeddings.py ```
36
- - Uploads embeddings to HuggingFace dataset
37
- - Used by HuggingFace Space deployment
38
-
39
- 3. **Generation**
40
- - Takes user prompt
41
- - Finds similar lyrics
42
- - Uses context for GPT-4 generation
43
- - Returns generated lyrics with sources
44
-
45
- ## Environment Setup
46
- - Local development: Uses project-relative paths
47
- - HuggingFace deployment: Uses persistent storage paths
48
- - Controlled via `DEPLOYMENT_MODE` environment variable
49
-
50
- ## Key Files
51
- - `config/settings.py`: Central configuration
52
- - `src/generator/generator.py`: Core generation logic
53
- - `scripts/process_lyrics.py`: Data processing pipeline
54
- - `scripts/test_embeddings.py`: Database validation
55
-
56
- ## Development Notes
57
- - Always use "langchain" as ChromaDB collection name
58
- - Embeddings are versioned in HuggingFace dataset
59
- - Local testing available via test scripts
60
- - Comprehensive logging throughout pipeline
61
-
62
- ## Deployment
63
- The system is deployed as:
64
- - HuggingFace Space: SongLift/LyrGen2
65
- - Database: SongLift/LyrGen2_DB
66
-
67
- ## Testing
68
- ```bash
69
- python scripts/test_embeddings.py
70
- python scripts/test_semantic.py
71
- ```
72
- !
73
- For troubleshooting and known issues, see TROUBLESHOOTING.md
74
-
75
- ## Data Processing & Loading Workflow
76
-
77
- ### 1. Lyrics Processing (`scripts/process_lyrics.py`)
78
- - Handles initial lyrics processing and embedding creation
79
- - Features:
80
- - Batch processing with rate limiting
81
- - Recursive text splitting (300 chars, 75 overlap)
82
- - Automatic retry on API limits
83
- - Comprehensive metadata tracking
84
- - Progress monitoring with tqdm
85
-
86
- ### 2. Data Loading (`src/utils/data_loader.py`)
87
- - Manages raw lyrics loading and cleaning
88
- - Features:
89
- - Multi-encoding support (utf-8, latin-1, cp1252)
90
- - Section marker detection ([Verse], [Chorus], etc.)
91
- - Intelligent line breaking
92
- - Metadata preservation (artist, song title)
93
- - Robust validation checks
94
-
95
- ### 3. Workflow Steps
96
- ```bash
97
- # 1. Load and clean lyrics
98
- python scripts/process_lyrics.py
99
- # Creates cleaned, chunked documents with metadata
100
-
101
- # 2. Generate embeddings
102
- # (Handled automatically by process_lyrics.py)
103
- # Stores in data/processed/embeddings/chroma
104
-
105
- # 3. Upload to HuggingFace
106
- python scripts/upload_embeddings.py
107
- # Verifies and uploads to SongLift/LyrGen2_DB
108
- ```
109
-
110
- ### Data Flow
111
- ```
112
- Raw Lyrics (txt)
113
-
114
- Data Loader (cleaning)
115
-
116
- Text Splitter (chunking)
117
-
118
- OpenAI Embeddings
119
-
120
- ChromaDB Storage
121
-
122
- HuggingFace Dataset
123
- ```
124
-
125
- Each step includes validation and error handling to ensure data integrity. The process is designed to be resumable and maintains consistency between local and deployed environments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TODO.md DELETED
@@ -1,43 +0,0 @@
1
- merge cleanstart2 to main
2
- remove TODO.txt from github repo
3
-
4
-
5
-
6
- Figure out which LLM to use instead of OpenAI
7
- Test kwargs, temperature, and max_tokens
8
- Figure out if its already chunking lyrics or if I need to do that
9
- Figure out best chunk size and overlap (Set in generator.py
10
- Make it use existing vector store unless I change it
11
- Make sure Billboard #1s are in the lyric
12
- Use lyrics_csv_to_txt workspace in cursor. I used this to makes the billboards.csv into seperate files.
13
-
14
- Filter out the first line of each text file if it looks like this:
15
- ContributorsTranslationsDeutschEspañolPortuguêsFrançaisفارسیNederlands한국어DanskItalianoРусскийbreak
16
-
17
-
18
- maybe add the artist name and genre to the lyric files so that the LLM can use that information to write better lyrics.
19
-
20
- add temperature and top_p sliders in streamlit
21
-
22
- TODO: Consider replacing OpenAI embeddings with local model
23
- - Pros:
24
- - No API dependency for vector search
25
- - Reduced costs
26
- - Faster response times (no API latency)
27
- - Cons:
28
- - Need to validate quality against current OpenAI embeddings
29
- - May require more compute resources
30
- - Need to ensure consistent vector space with existing embeddings
31
-
32
- Potential options:
33
- 1. HuggingFace sentence-transformers
34
- 2. all-MiniLM-L6-v2
35
- 3. all-mpnet-base-v2
36
-
37
- Next steps:
38
- 1. Research embedding model options
39
- 2. Test quality against OpenAI embeddings
40
- 3. Benchmark performance
41
- 4. Plan migration strategy
42
-
43
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TROUBLESHOOTING.md DELETED
@@ -1,44 +0,0 @@
1
- # Troubleshooting Guide
2
-
3
- ## Embeddings Issues
4
-
5
- ### Empty Chroma Collection (0 Documents)
6
- **Symptom:**
7
- - ChromaDB shows 0 documents despite large files being present
8
- - SQLite database shows records (e.g., embeddings: 233998 records)
9
- - Files exist and have expected sizes:
10
- - chroma.sqlite3 (~576 MB)
11
- - data_level0.bin (~1.3 GB)
12
-
13
- **Cause:**
14
- Collection name mismatch between processing and loading. The system uses two collections:
15
- - "langchain" (contains the data)
16
- - "lyrics" (empty)
17
-
18
- **Solution:**
19
- Always use "langchain" as the collection name in all operations:
20
- ```python
21
- vector_store = Chroma(
22
- persist_directory=str(chroma_dir),
23
- embedding_function=embeddings,
24
- collection_name="langchain" # Must be "langchain"
25
- )
26
- ```
27
-
28
- **Verification:**
29
- Run the test script to check collections:
30
-
31
- ```bash
32
- python scripts/test_embeddings.py
33
- ```
34
-
35
- Expected output:
36
- ```
37
- Collection names: [Collection(name=langchain), Collection(name=lyrics)]
38
- Collection count: 233998 # For langchain collection
39
- ```
40
-
41
- **Files to Check:**
42
- 1. config/settings.py: CHROMA_COLLECTION_NAME
43
- 2. src/generator/generator.py: vector_store initialization
44
- 3. scripts/process_lyrics.py: Chroma.from_documents() call