Asalun commited on
Commit
f689d98
·
verified ·
1 Parent(s): 5e6bf85

Upload 4 files

Browse files
Files changed (4) hide show
  1. M1_UML_DR_Nomads.ipynb +0 -0
  2. README.md +1 -13
  3. requirements.txt +11 -0
  4. spotify_clusters_app.py +265 -0
M1_UML_DR_Nomads.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -1,13 +1 @@
1
- ---
2
- title: USLAssignment
3
- emoji: 😻
4
- colorFrom: gray
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.46.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # spotify
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ appdirs==1.4.4
2
+ argon2-cffi==21.1.0
3
+ asttokens==2.4.1
4
+ attrs==23.2.0
5
+ streamlit
6
+ pandas
7
+ numpy
8
+ matplotlib
9
+ seaborn
10
+ altair
11
+ scikit-learn
spotify_clusters_app.py ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --- Imports ---
2
+ import streamlit as st
3
+ import pandas as pd
4
+ import numpy as np
5
+ import matplotlib.pyplot as plt
6
+ import seaborn as sns
7
+ import altair as alt
8
+ from sklearn.preprocessing import StandardScaler
9
+ from sklearn.decomposition import PCA
10
+ from sklearn.cluster import KMeans
11
+
12
+ # --- Page Config ---
13
+ st.set_page_config(
14
+ page_title="Spotify Song Clustering — Business Insights",
15
+ page_icon="🎵",
16
+ layout="wide",
17
+ )
18
+
19
+ # --- Title / Sidebar ---
20
+ st.title("🎵 Spotify Song Clustering — Business Insights Dashboard")
21
+ st.sidebar.header("Filters 📊")
22
+
23
+ # --- Data Loading & Caching (mantido) ---
24
+ @st.cache_data
25
+ def load_raw_data():
26
+ url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
27
+ return pd.read_csv(url)
28
+
29
+ df = load_raw_data()
30
+
31
+ # --- Business Problem Statement (mantido e no estilo do 1º app) ---
32
+ st.markdown("""
33
+ #### Business Problem
34
+ Spotify aims to deliver smarter recommendations and more engaging playlists by understanding the *mood* and *context* of songs, not just their genre.
35
+ This dashboard uses **unsupervised learning** to reveal actionable clusters, helping Spotify personalize user experience, optimize curation, and unlock new business opportunities.
36
+ """)
37
+
38
+ with st.expander("📊 **Key Components of the Analysis**"):
39
+ st.markdown("""
40
+ - **Audio features**: `danceability`, `energy`, `loudness`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`.
41
+ - **Dimensionality reduction**: PCA for visualization (2D/3D) e melhor separação de grupos.
42
+ - **Clustering**: KMeans para identificar segmentos semelhantes por “mood/context”.
43
+ - **Business view**: Perfis de clusters com recomendações acionáveis.
44
+ """)
45
+
46
+ # --- Cleaning & Scaling (mantido) ---
47
+ @st.cache_data
48
+ def clean_and_scale(songs):
49
+ drop_cols = [
50
+ "track_id", "track_name", "track_artist", "track_album_id",
51
+ "track_album_name", "track_album_release_date",
52
+ "playlist_name", "playlist_id", "tempo", "duration_ms"
53
+ ]
54
+ songs_clean = songs.drop(columns=drop_cols, errors="ignore")
55
+ features = [
56
+ "danceability", "energy", "loudness", "speechiness", "acousticness",
57
+ "instrumentalness", "liveness", "valence"
58
+ ]
59
+ X = songs_clean[features].dropna()
60
+ scaler = StandardScaler()
61
+ X_scaled = scaler.fit_transform(X)
62
+ return songs, songs_clean, X, X_scaled, features
63
+
64
+ songs = load_raw_data()
65
+ songs, songs_clean, X, X_scaled, features = clean_and_scale(songs)
66
+
67
+ # --- Sidebar Filters (no estilo do 1º app; seguros quanto a colunas) ---
68
+ # Playlist genre filter
69
+ if "playlist_genre" in songs.columns:
70
+ all_genres = sorted(list(pd.Series(songs["playlist_genre"].dropna().unique()).astype(str)))
71
+ else:
72
+ all_genres = []
73
+ selected_genres = st.sidebar.multiselect("Select Playlist Genre 🎧", all_genres, default=all_genres if all_genres else None)
74
+
75
+ # Popularity range filter
76
+ if "track_popularity" in songs.columns:
77
+ pop_min, pop_max = int(songs["track_popularity"].min()), int(songs["track_popularity"].max())
78
+ pop_range = st.sidebar.slider("Track Popularity Range ⭐", pop_min, pop_max, (pop_min, pop_max))
79
+ else:
80
+ pop_range = None
81
+
82
+ # Apply filters to an auxiliary DataFrame (apenas para visuais que usam 'songs')
83
+ songs_filtered = songs.copy()
84
+ if selected_genres:
85
+ songs_filtered = songs_filtered[songs_filtered["playlist_genre"].astype(str).isin(selected_genres)]
86
+ if pop_range:
87
+ songs_filtered = songs_filtered[(songs_filtered["track_popularity"] >= pop_range[0]) &
88
+ (songs_filtered["track_popularity"] <= pop_range[1])]
89
+
90
+ # --- Controls for PCA & KMeans (mantidos) ---
91
+ st.sidebar.header("🔎 Explore Clusters")
92
+ n_components = st.sidebar.slider("PCA Components (for visualization)", 2, len(features), 3)
93
+ k_clusters = st.sidebar.slider("Number of clusters (KMeans)", 2, 15, 10)
94
+
95
+ @st.cache_data
96
+ def run_pca(X_scaled, n_components):
97
+ pca = PCA(n_components=n_components)
98
+ return pca.fit_transform(X_scaled)
99
+
100
+ @st.cache_data
101
+ def run_kmeans(X_pca, k_clusters):
102
+ kmeans = KMeans(n_clusters=k_clusters, random_state=42, n_init=10)
103
+ return kmeans.fit_predict(X_pca)
104
+
105
+ X_pca = run_pca(X_scaled, n_components)
106
+ clusters = run_kmeans(X_pca, k_clusters)
107
+
108
+ songs_clustered = songs_clean.loc[X.index].copy()
109
+ songs_clustered["cluster"] = clusters
110
+
111
+ # --- Visualization Selector (como no 1º app) ---
112
+ st.header("Analysis 📊")
113
+ visualization_option = st.selectbox(
114
+ "Select Visualization 🎨",
115
+ [
116
+ "2D PCA Scatter (by cluster)",
117
+ "3D PCA Scatter (by cluster)",
118
+ "Cluster Profiles — Average Audio Features (heatmap)",
119
+ "Feature Distributions by Cluster (boxplots)",
120
+ "Correlation Heatmap of Audio Features",
121
+ "Are clusters separable by popularity? (Altair scatter)"
122
+ ],
123
+ )
124
+
125
+ # --- Visualizations ---
126
+ if visualization_option == "2D PCA Scatter (by cluster)":
127
+ fig, ax = plt.subplots(figsize=(8, 5))
128
+ sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=clusters, palette="tab10", s=12, ax=ax, legend=False)
129
+ ax.set_xlabel("PC1")
130
+ ax.set_ylabel("PC2")
131
+ ax.set_title("2D PCA — Songs clustered")
132
+ st.pyplot(fig)
133
+
134
+ elif visualization_option == "3D PCA Scatter (by cluster)":
135
+ if n_components >= 3:
136
+ from mpl_toolkits.mplot3d import Axes3D # noqa: F401
137
+ fig = plt.figure(figsize=(8, 6))
138
+ ax = fig.add_subplot(111, projection='3d')
139
+ sc = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=clusters, cmap='tab10', s=8)
140
+ ax.set_xlabel("PC1")
141
+ ax.set_ylabel("PC2")
142
+ ax.set_zlabel("PC3")
143
+ ax.set_title("3D PCA — Songs clustered")
144
+ st.pyplot(fig)
145
+ else:
146
+ st.info("Select at least 3 PCA components for 3D visualization.")
147
+
148
+ elif visualization_option == "Cluster Profiles — Average Audio Features (heatmap)":
149
+ st.subheader("Cluster Profiles — Average Audio Features")
150
+ cluster_profile = songs_clustered.groupby("cluster")[features].mean().round(2)
151
+ fig, ax = plt.subplots(figsize=(10, 5))
152
+ sns.heatmap(cluster_profile, annot=True, cmap="viridis", ax=ax)
153
+ ax.set_title("Average Feature Values per Cluster")
154
+ st.pyplot(fig)
155
+ st.dataframe(cluster_profile)
156
+
157
+ elif visualization_option == "Feature Distributions by Cluster (boxplots)":
158
+ st.subheader("Feature Distributions by Cluster")
159
+ selected_feature = st.selectbox("Select feature", features, index=0)
160
+ fig, ax = plt.subplots(figsize=(10, 5))
161
+ sns.boxplot(data=songs_clustered, x="cluster", y=selected_feature, ax=ax)
162
+ ax.set_title(f"Distribution of {selected_feature} by cluster")
163
+ st.pyplot(fig)
164
+
165
+ elif visualization_option == "Correlation Heatmap of Audio Features":
166
+ st.subheader("Correlation Heatmap")
167
+ corr = pd.DataFrame(X, columns=features).corr()
168
+ fig, ax = plt.subplots(figsize=(8, 6))
169
+ sns.heatmap(corr, annot=True, cmap="coolwarm", center=0, ax=ax)
170
+ ax.set_title("Correlation among audio features")
171
+ st.pyplot(fig)
172
+
173
+ elif visualization_option == "Are clusters separable by popularity? (Altair scatter)":
174
+ # Só usa se houver track_popularity; mapeia com as amostras do cluster
175
+ if "track_popularity" in songs.columns:
176
+ # alinhar índices: precisamos trazer popularity de 'songs' para 'songs_clustered'
177
+ pop_series = songs.loc[songs_clustered.index, "track_popularity"] if "track_popularity" in songs.columns else pd.Series(index=songs_clustered.index, dtype=float)
178
+ viz_df = songs_clustered.copy()
179
+ viz_df["track_popularity"] = pop_series
180
+ viz_df = viz_df.dropna(subset=["track_popularity"])
181
+ chart = alt.Chart(viz_df.reset_index(drop=True)).mark_point(filled=True).encode(
182
+ alt.X('track_popularity:Q', title='Track popularity'),
183
+ alt.Y('valence:Q', title='Valence'),
184
+ alt.Color('cluster:N'),
185
+ alt.OpacityValue(0.7),
186
+ tooltip=['cluster:N'] + [c for c in ["valence", "energy", "danceability"] if c in viz_df.columns]
187
+ ).properties(height=450)
188
+ st.altair_chart(chart, use_container_width=True)
189
+ else:
190
+ st.warning("`track_popularity` not available in this dataset.")
191
+
192
+ # --- Cluster Insights (mantidos) ---
193
+ st.sidebar.markdown("---")
194
+ selected_cluster = st.sidebar.selectbox("Select cluster for details", sorted(songs_clustered["cluster"].unique()))
195
+
196
+ cluster_business = {
197
+ 0: "Acoustic / Chill 🌿: Calm, relaxing. Use in study/wellness playlists.",
198
+ 1: "Classic Rock 🎸: Guitar-driven, nostalgic. Promote with live events.",
199
+ 2: "EDM / Dance 🎧: High-energy. Add to workout/party playlists.",
200
+ 3: "Electropop / Dance Pop 🔥: Catchy, upbeat. Viral playlists, social media.",
201
+ 4: "Hard Rock / Metal 🤘: Loud, intense. Niche playlists, festival tie-ins.",
202
+ 5: "Indie / Alternative 🌌: Creative, experimental. Discovery playlists.",
203
+ 6: "Latin / Reggaeton 🌴: Rhythmic, upbeat. Geo-targeted playlists.",
204
+ 7: "Pop Mainstream 🎶: Balanced, mass-market. Chart-topping hits.",
205
+ 8: "R&B / Soul 💜: Smooth, emotional. Romance/mood playlists.",
206
+ 9: "Rap / Trap 🎤: Speech-heavy, beat-driven. Youth/urban playlists."
207
+ }
208
+
209
+ cluster_actions = {
210
+ 0: "Focus/study playlists, wellness app partnerships.",
211
+ 1: "Live concert tie-ins, nostalgic campaigns.",
212
+ 2: "Workout/party playlists, fitness brand collaborations.",
213
+ 3: "Viral playlist promotion, social media campaigns.",
214
+ 4: "Niche playlist curation, festival partnerships.",
215
+ 5: "Discovery playlists, support for emerging artists.",
216
+ 6: "Geo-targeted playlists, dance event promotions.",
217
+ 7: "Algorithmic playlist anchors, sponsored content.",
218
+ 8: "Romance/mood playlists, lifestyle brand partnerships.",
219
+ 9: "Youth/urban playlists, influencer collaborations."
220
+ }
221
+
222
+ st.markdown(f"### Cluster {selected_cluster} — Business Insights")
223
+ st.markdown(f"**Business Description:** {cluster_business.get(selected_cluster, 'Segmented by mood/context.')}")
224
+ st.markdown(f"**Actionable Recommendation:** {cluster_actions.get(selected_cluster, 'Curate and promote according to cluster characteristics.')}")
225
+ st.markdown("**Sample Songs in this Cluster:**")
226
+ sample_cols = [c for c in ["track_name", "track_artist", "playlist_genre"] if c in songs.columns]
227
+ st.write(songs.loc[songs_clustered[songs_clustered['cluster'] == selected_cluster].index, sample_cols].head(10))
228
+
229
+ # --- Dataset Overview (no estilo do 1º app) ---
230
+ st.header("Dataset Overview")
231
+ st.dataframe(songs.describe(include='all').transpose())
232
+
233
+ # --- Insights Expander (no estilo do 1º app) ---
234
+ with st.expander("Interpreting the visualizations"):
235
+ st.markdown("""
236
+ 1. **PCA scatter** — clusters tendem a ocupar regiões distintas, sugerindo *moods* diferentes (e.g., alto `energy` + baixo `acousticness` próximos).
237
+ 2. **Heatmap de perfis** — médias por cluster evidenciam contrastes claros (e.g., `valence`/`danceability` altos para pop/dance).
238
+ 3. **Boxplots por cluster** — mostram dispersão e outliers por feature, úteis para ajustar K ou features.
239
+ 4. **Correlação** — `energy` e `loudness` costumam correlacionar; atenção ao leakage de escala.
240
+ 5. **Popularidade vs. valence** — clusters com maior `valence`/`danceability` podem apresentar popularidade maior (insight para marketing).
241
+ """)
242
+
243
+ # --- Rationale & Strategic Insights (mantido) ---
244
+ st.header("Rationale & Strategic Insights")
245
+ st.markdown("""
246
+ ### Why This Approach?
247
+ - **Business Need:** Genre-based recommendations miss nuances of mood/context. Clustering por audio features revela segmentos acionáveis.
248
+ - **Data Cleaning:** Remoção de colunas não-audio e linhas incompletas.
249
+ - **Feature Selection:** Foco nos 8 atributos mais ligados a “mood/context”.
250
+ - **Scaling:** `StandardScaler` equilibra contribuições.
251
+ - **PCA:** Visualização e separação mais clara.
252
+ - **KMeans:** Grupos interpretáveis para ação.
253
+
254
+ ### Strategic Insights for Stakeholders
255
+ - **Personalization:** Recomendações contextuais aumentam relevância e satisfação.
256
+ - **Playlist Curation:** Atribuição automática de novas faixas a clusters específicos.
257
+ - **Marketing & Engagement:** Campanhas e playlists temáticas por cluster.
258
+ - **Artist Discovery:** Tendências emergentes por segmento.
259
+ - **Partnerships & Revenue:** Parcerias alinhadas ao *mood* (wellness, fitness, etc.).
260
+ - **Continuous Improvement:** Medir performance por cluster → iterar K, features e regras.
261
+ """)
262
+
263
+ # --- Footer ---
264
+ st.markdown("---")
265
+ st.markdown("© 2025 Spotify Clustering Assignment — Powered by Streamlit")