chat_bot / README_clustering.md
tomthekkan's picture
Upload folder using huggingface_hub
6d5953d verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Description Clustering Script

This script provides comprehensive text clustering capabilities for analyzing descriptions from CSV files. It supports multiple clustering algorithms and provides detailed analysis and visualization.

Features

  • Multiple Clustering Methods: K-means, Hierarchical clustering, LDA topic modeling, and NMF topic modeling
  • Text Preprocessing: Tokenization, stopword removal, lemmatization
  • TF-IDF Vectorization: Converts text to numerical features
  • Clustering Evaluation: Silhouette score and Calinski-Harabasz score
  • Keyword Extraction: Identifies top keywords for each cluster
  • Visualization: t-SNE plots for cluster visualization
  • Sample Data: Built-in sample data for testing and demonstration

Installation

  1. Install the required dependencies:
pip install -r clustering_requirements.txt
  1. The script will automatically download required NLTK data on first run.

Usage

Command Line Interface

The main script can be used from the command line:

# Create sample data
python description_clustering.py --create-sample

# Cluster with default settings (K-means, 5 clusters)
python description_clustering.py --input sample_descriptions.csv --method kmeans --clusters 5

# Use hierarchical clustering with visualization
python description_clustering.py --input data.csv --method hierarchical --clusters 8 --visualize

# Topic modeling with LDA
python description_clustering.py --input data.csv --method lda --clusters 6 --max-features 2000

# Specify custom column name
python description_clustering.py --input data.csv --column "product_description" --method kmeans --clusters 10

Command Line Options

  • --input: Input CSV file path
  • --column: Column name containing descriptions (default: 'description')
  • --method: Clustering method ('kmeans', 'hierarchical', 'lda', 'nmf')
  • --clusters: Number of clusters/topics
  • --max-features: Maximum features for TF-IDF vectorization
  • --output: Output JSON file for results
  • --visualize: Generate t-SNE visualization
  • --create-sample: Create sample data for testing

Python API

You can also use the clustering functionality programmatically:

from description_clustering import DescriptionClusterer
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Initialize clusterer
clusterer = DescriptionClusterer()

# Preprocess and vectorize
descriptions = df['description'].tolist()
processed_descriptions = clusterer.preprocess_text(descriptions)
embeddings = clusterer.vectorize_text(processed_descriptions)

# Perform clustering
cluster_labels = clusterer.kmeans_clustering(embeddings, n_clusters=5)

# Evaluate results
evaluation_scores = clusterer.evaluate_clustering(embeddings)
cluster_keywords = clusterer.get_cluster_keywords()

# Visualize
clusterer.visualize_clusters(embeddings)

Example Usage

Running the Example Script

python example_clustering.py

This will:

  1. Create sample data with 100 descriptions across 10 categories
  2. Run clustering with 3 different methods (K-means, Hierarchical, LDA)
  3. Compare results and save outputs
  4. Provide detailed analysis of clusters

Sample Output

=== Description Clustering Example ===

1. Creating sample data...
   Created 100 sample descriptions

2. Initializing clusterer...

3. Preprocessing text...
   Preprocessed 100 descriptions

4. Vectorizing text...
   Created embeddings with shape: (100, 500)

5. Performing KMEANS clustering...
   Silhouette Score: 0.2345
   Calinski-Harabasz Score: 45.67

6. Clustering Results Summary:
==================================================

KMEANS Clustering:
  Silhouette Score: 0.2345
  Calinski-Harabasz Score: 45.67
  Cluster Distribution:
    Cluster 0: 12 descriptions
    Cluster 1: 8 descriptions
    ...
  Top Keywords by Cluster:
    Cluster 0: restaurant, food, dining, service, cuisine
    Cluster 1: technology, software, system, platform, solution
    ...

Output Files

The script generates several output files:

  1. clustering_results.json: Detailed clustering results including:

    • Evaluation scores
    • Cluster keywords
    • Cluster distribution
    • Sample descriptions per cluster
  2. clustered_[input_file].csv: Original data with cluster labels added

  3. cluster_visualization.png: t-SNE visualization (if --visualize flag used)

Clustering Methods

1. K-means Clustering

  • Best for: General-purpose clustering when you know the number of clusters
  • Pros: Fast, simple, works well with TF-IDF vectors
  • Cons: Assumes spherical clusters, sensitive to initialization

2. Hierarchical Clustering

  • Best for: When you want to understand cluster hierarchy
  • Pros: No assumptions about cluster shape, provides dendrogram
  • Cons: Slower for large datasets, memory intensive

3. LDA Topic Modeling

  • Best for: Discovering latent topics in text data
  • Pros: Probabilistic, provides topic distributions
  • Cons: Assumes documents are mixtures of topics

4. NMF Topic Modeling

  • Best for: Non-negative topic modeling
  • Pros: Non-negative factors, often more interpretable
  • Cons: Sensitive to initialization, may not converge

Evaluation Metrics

Silhouette Score

  • Range: -1 to 1
  • Interpretation: Higher is better
  • Meaning: Measures how similar an object is to its own cluster vs other clusters

Calinski-Harabasz Score

  • Range: 0 to ∞
  • Interpretation: Higher is better
  • Meaning: Ratio of between-cluster dispersion to within-cluster dispersion

Tips for Best Results

  1. Preprocess your data: Ensure descriptions are clean and consistent
  2. Choose appropriate number of clusters: Use domain knowledge or elbow method
  3. Experiment with different methods: Try multiple algorithms and compare results
  4. Adjust max_features: More features capture more detail but may introduce noise
  5. Use visualization: t-SNE plots help understand cluster structure

Troubleshooting

Common Issues

  1. Memory errors with large datasets: Reduce max_features or use smaller datasets
  2. Poor clustering quality: Try different numbers of clusters or preprocessing
  3. NLTK download errors: Run python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

Performance Optimization

  • For large datasets (>10k descriptions), consider using batch processing
  • Use max_features parameter to control memory usage
  • Consider using sparse matrices for very large feature sets

License

This script is part of the AI Agents course materials and follows the same license as the main project.