Spaces:

tomthekkan
/

chat_bot

Sleeping

App Files Files Community

chat_bot / README_clustering.md

tomthekkan

Upload folder using huggingface_hub

6d5953d verified 9 months ago

preview code

raw

history blame contribute delete

6.75 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Description Clustering Script

This script provides comprehensive text clustering capabilities for analyzing descriptions from CSV files. It supports multiple clustering algorithms and provides detailed analysis and visualization.

Features

Multiple Clustering Methods: K-means, Hierarchical clustering, LDA topic modeling, and NMF topic modeling
Text Preprocessing: Tokenization, stopword removal, lemmatization
TF-IDF Vectorization: Converts text to numerical features
Clustering Evaluation: Silhouette score and Calinski-Harabasz score
Keyword Extraction: Identifies top keywords for each cluster
Visualization: t-SNE plots for cluster visualization
Sample Data: Built-in sample data for testing and demonstration

Installation

Install the required dependencies:

pip install -r clustering_requirements.txt

The script will automatically download required NLTK data on first run.

Usage

Command Line Interface

The main script can be used from the command line:

# Create sample data
python description_clustering.py --create-sample

# Cluster with default settings (K-means, 5 clusters)
python description_clustering.py --input sample_descriptions.csv --method kmeans --clusters 5

# Use hierarchical clustering with visualization
python description_clustering.py --input data.csv --method hierarchical --clusters 8 --visualize

# Topic modeling with LDA
python description_clustering.py --input data.csv --method lda --clusters 6 --max-features 2000

# Specify custom column name
python description_clustering.py --input data.csv --column "product_description" --method kmeans --clusters 10

Command Line Options

--input: Input CSV file path
--column: Column name containing descriptions (default: 'description')
--method: Clustering method ('kmeans', 'hierarchical', 'lda', 'nmf')
--clusters: Number of clusters/topics
--max-features: Maximum features for TF-IDF vectorization
--output: Output JSON file for results
--visualize: Generate t-SNE visualization
--create-sample: Create sample data for testing

Python API

You can also use the clustering functionality programmatically:

from description_clustering import DescriptionClusterer
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Initialize clusterer
clusterer = DescriptionClusterer()

# Preprocess and vectorize
descriptions = df['description'].tolist()
processed_descriptions = clusterer.preprocess_text(descriptions)
embeddings = clusterer.vectorize_text(processed_descriptions)

# Perform clustering
cluster_labels = clusterer.kmeans_clustering(embeddings, n_clusters=5)

# Evaluate results
evaluation_scores = clusterer.evaluate_clustering(embeddings)
cluster_keywords = clusterer.get_cluster_keywords()

# Visualize
clusterer.visualize_clusters(embeddings)

Example Usage

Running the Example Script

python example_clustering.py

This will:

Create sample data with 100 descriptions across 10 categories
Run clustering with 3 different methods (K-means, Hierarchical, LDA)
Compare results and save outputs
Provide detailed analysis of clusters

Sample Output

=== Description Clustering Example ===

1. Creating sample data...
   Created 100 sample descriptions

2. Initializing clusterer...

3. Preprocessing text...
   Preprocessed 100 descriptions

4. Vectorizing text...
   Created embeddings with shape: (100, 500)

5. Performing KMEANS clustering...
   Silhouette Score: 0.2345
   Calinski-Harabasz Score: 45.67

6. Clustering Results Summary:
==================================================

KMEANS Clustering:
  Silhouette Score: 0.2345
  Calinski-Harabasz Score: 45.67
  Cluster Distribution:
    Cluster 0: 12 descriptions
    Cluster 1: 8 descriptions
    ...
  Top Keywords by Cluster:
    Cluster 0: restaurant, food, dining, service, cuisine
    Cluster 1: technology, software, system, platform, solution
    ...

Output Files

The script generates several output files:

clustering_results.json: Detailed clustering results including:
- Evaluation scores
- Cluster keywords
- Cluster distribution
- Sample descriptions per cluster
clustered_[input_file].csv: Original data with cluster labels added
cluster_visualization.png: t-SNE visualization (if --visualize flag used)

Clustering Methods

1. K-means Clustering

Best for: General-purpose clustering when you know the number of clusters
Pros: Fast, simple, works well with TF-IDF vectors
Cons: Assumes spherical clusters, sensitive to initialization

2. Hierarchical Clustering

Best for: When you want to understand cluster hierarchy
Pros: No assumptions about cluster shape, provides dendrogram
Cons: Slower for large datasets, memory intensive

3. LDA Topic Modeling

Best for: Discovering latent topics in text data
Pros: Probabilistic, provides topic distributions
Cons: Assumes documents are mixtures of topics

4. NMF Topic Modeling

Best for: Non-negative topic modeling
Pros: Non-negative factors, often more interpretable
Cons: Sensitive to initialization, may not converge

Evaluation Metrics

Silhouette Score

Range: -1 to 1
Interpretation: Higher is better
Meaning: Measures how similar an object is to its own cluster vs other clusters

Calinski-Harabasz Score

Range: 0 to ∞
Interpretation: Higher is better
Meaning: Ratio of between-cluster dispersion to within-cluster dispersion

Tips for Best Results

Preprocess your data: Ensure descriptions are clean and consistent
Choose appropriate number of clusters: Use domain knowledge or elbow method
Experiment with different methods: Try multiple algorithms and compare results
Adjust max_features: More features capture more detail but may introduce noise
Use visualization: t-SNE plots help understand cluster structure

Troubleshooting

Common Issues

Memory errors with large datasets: Reduce max_features or use smaller datasets
Poor clustering quality: Try different numbers of clusters or preprocessing
NLTK download errors: Run python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

Performance Optimization

For large datasets (>10k descriptions), consider using batch processing
Use max_features parameter to control memory usage
Consider using sparse matrices for very large feature sets

License

This script is part of the AI Agents course materials and follows the same license as the main project.