Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
Description Clustering Script
This script provides comprehensive text clustering capabilities for analyzing descriptions from CSV files. It supports multiple clustering algorithms and provides detailed analysis and visualization.
Features
- Multiple Clustering Methods: K-means, Hierarchical clustering, LDA topic modeling, and NMF topic modeling
- Text Preprocessing: Tokenization, stopword removal, lemmatization
- TF-IDF Vectorization: Converts text to numerical features
- Clustering Evaluation: Silhouette score and Calinski-Harabasz score
- Keyword Extraction: Identifies top keywords for each cluster
- Visualization: t-SNE plots for cluster visualization
- Sample Data: Built-in sample data for testing and demonstration
Installation
- Install the required dependencies:
pip install -r clustering_requirements.txt
- The script will automatically download required NLTK data on first run.
Usage
Command Line Interface
The main script can be used from the command line:
# Create sample data
python description_clustering.py --create-sample
# Cluster with default settings (K-means, 5 clusters)
python description_clustering.py --input sample_descriptions.csv --method kmeans --clusters 5
# Use hierarchical clustering with visualization
python description_clustering.py --input data.csv --method hierarchical --clusters 8 --visualize
# Topic modeling with LDA
python description_clustering.py --input data.csv --method lda --clusters 6 --max-features 2000
# Specify custom column name
python description_clustering.py --input data.csv --column "product_description" --method kmeans --clusters 10
Command Line Options
--input: Input CSV file path--column: Column name containing descriptions (default: 'description')--method: Clustering method ('kmeans', 'hierarchical', 'lda', 'nmf')--clusters: Number of clusters/topics--max-features: Maximum features for TF-IDF vectorization--output: Output JSON file for results--visualize: Generate t-SNE visualization--create-sample: Create sample data for testing
Python API
You can also use the clustering functionality programmatically:
from description_clustering import DescriptionClusterer
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Initialize clusterer
clusterer = DescriptionClusterer()
# Preprocess and vectorize
descriptions = df['description'].tolist()
processed_descriptions = clusterer.preprocess_text(descriptions)
embeddings = clusterer.vectorize_text(processed_descriptions)
# Perform clustering
cluster_labels = clusterer.kmeans_clustering(embeddings, n_clusters=5)
# Evaluate results
evaluation_scores = clusterer.evaluate_clustering(embeddings)
cluster_keywords = clusterer.get_cluster_keywords()
# Visualize
clusterer.visualize_clusters(embeddings)
Example Usage
Running the Example Script
python example_clustering.py
This will:
- Create sample data with 100 descriptions across 10 categories
- Run clustering with 3 different methods (K-means, Hierarchical, LDA)
- Compare results and save outputs
- Provide detailed analysis of clusters
Sample Output
=== Description Clustering Example ===
1. Creating sample data...
Created 100 sample descriptions
2. Initializing clusterer...
3. Preprocessing text...
Preprocessed 100 descriptions
4. Vectorizing text...
Created embeddings with shape: (100, 500)
5. Performing KMEANS clustering...
Silhouette Score: 0.2345
Calinski-Harabasz Score: 45.67
6. Clustering Results Summary:
==================================================
KMEANS Clustering:
Silhouette Score: 0.2345
Calinski-Harabasz Score: 45.67
Cluster Distribution:
Cluster 0: 12 descriptions
Cluster 1: 8 descriptions
...
Top Keywords by Cluster:
Cluster 0: restaurant, food, dining, service, cuisine
Cluster 1: technology, software, system, platform, solution
...
Output Files
The script generates several output files:
clustering_results.json: Detailed clustering results including:- Evaluation scores
- Cluster keywords
- Cluster distribution
- Sample descriptions per cluster
clustered_[input_file].csv: Original data with cluster labels addedcluster_visualization.png: t-SNE visualization (if --visualize flag used)
Clustering Methods
1. K-means Clustering
- Best for: General-purpose clustering when you know the number of clusters
- Pros: Fast, simple, works well with TF-IDF vectors
- Cons: Assumes spherical clusters, sensitive to initialization
2. Hierarchical Clustering
- Best for: When you want to understand cluster hierarchy
- Pros: No assumptions about cluster shape, provides dendrogram
- Cons: Slower for large datasets, memory intensive
3. LDA Topic Modeling
- Best for: Discovering latent topics in text data
- Pros: Probabilistic, provides topic distributions
- Cons: Assumes documents are mixtures of topics
4. NMF Topic Modeling
- Best for: Non-negative topic modeling
- Pros: Non-negative factors, often more interpretable
- Cons: Sensitive to initialization, may not converge
Evaluation Metrics
Silhouette Score
- Range: -1 to 1
- Interpretation: Higher is better
- Meaning: Measures how similar an object is to its own cluster vs other clusters
Calinski-Harabasz Score
- Range: 0 to ∞
- Interpretation: Higher is better
- Meaning: Ratio of between-cluster dispersion to within-cluster dispersion
Tips for Best Results
- Preprocess your data: Ensure descriptions are clean and consistent
- Choose appropriate number of clusters: Use domain knowledge or elbow method
- Experiment with different methods: Try multiple algorithms and compare results
- Adjust max_features: More features capture more detail but may introduce noise
- Use visualization: t-SNE plots help understand cluster structure
Troubleshooting
Common Issues
- Memory errors with large datasets: Reduce
max_featuresor use smaller datasets - Poor clustering quality: Try different numbers of clusters or preprocessing
- NLTK download errors: Run
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"
Performance Optimization
- For large datasets (>10k descriptions), consider using batch processing
- Use
max_featuresparameter to control memory usage - Consider using sparse matrices for very large feature sets
License
This script is part of the AI Agents course materials and follows the same license as the main project.