73.7 kB

Title: Text Clustering as Classification with LLMs

URL Source: https://arxiv.org/html/2410.00927

Published Time: Wed, 08 Oct 2025 00:39:03 GMT

Markdown Content: (2025)

Abstract.

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models (LLMs) and their demonstrated effectiveness across a broad spectrum of NLP tasks, an emerging body of research has begun to explore their potential in the domain of text clustering. However, existing LLM-based approaches still rely on fine-tuned embedding models and sophisticated similarity metrics, rendering them computationally intensive and necessitating domain-specific adaptation. To address these limitations, we propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of LLMs. Our framework eliminates the need for fine-tuning embedding models or intricate clustering algorithms. It comprises two key steps: first, the LLM is prompted to generate a set of candidate labels based on the dataset and then merges semantically similar labels; second, it assigns the most appropriate label to each text sample. By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention. Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques, while significantly reducing computational complexity and resource requirements. These findings underscore the transformative potential of LLMs in simplifying and enhancing text clustering tasks. We make our code available to the public for utilization 1 1 1https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM, we also provide the supplementary Appendix within the repository..

Large Language Model, Text Clustering, Text Classification

††journalyear: 2025††copyright: acmlicensed††conference: Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region; December 7–10, 2025; Xi’an, China††booktitle: Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP 2025), December 7–10, 2025, Xi’an, China††doi: 10.1145/3767695.3769519††isbn: 979-8-4007-2218-9/2025/12 1. Introduction

Text clustering is a fundamental task in natural language processing (NLP), which aims to group similar texts based on their content without prior labeling. It is widely applied in scenarios where manual annotation is costly or impractical, such as improving community detection in social media (Qi et al., 2012; Wang et al., 2024), identifying emerging topics (Castellanos et al., 2017; Pang et al., 2025), analyzing large-scale textual datasets (Aggarwal and Zhai, 2012; Mehta et al., 2021), structuring unorganized information (Cutting et al., 2017), and enhancing document retrieval (Cutting et al., 1993; Anick and Vaithyanathan, 1997; Agarwal et al., 2020). Despite its importance, traditional text clustering faces notable challenges in both methodology and implementation.

Traditional text clustering methods typically involve transforming textual data into numerical representations and applying clustering algorithms to group similar texts based on these representations. A common approach is to use pre-trained embedding models (Devlin et al., 2018; Muennighoff et al., 2023; Wang et al., 2022; Su et al., 2022) to convert text into dense vector embeddings that capture semantic relationships. These embeddings are then clustered using algorithms like K-means (Lloyd, 1982), DBSCAN (Ester et al., 1996), or hierarchical clustering (Johnson, 1967). However, fine-tuning embeddings for domain-specific tasks is computationally expensive and requires labeled data (Song et al., 2025). Besides, clustering algorithms are sensitive to hyperparameters, such as the number of clusters and distance metrics, which often need manual tuning based on expert knowledge. The resulting clusters also lack interpretability, as clustering models typically do not produce meaningful labels for the groups. These challenges make traditional clustering approaches less flexible and efficient, particularly when applied to diverse, large-scale text datasets.

Recent advancements in LLMs, such as the GPT series (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023), have showcased remarkable reasoning performance across a wide range of NLP tasks (Zhang et al., 2025; He et al., 2025; Sheng et al., 2025). These models can comprehend and generate human-like text with great in-context learning abilities (He and Huang, 2025; Yao et al., 2025), making them potential candidates for clustering tasks. However, existing LLM-based clustering methods (Zhang et al., 2023; Wang et al., 2023; Viswanathan et al., 2024) still rely on external embedding models like BERT or E5 and traditional clustering techniques such as K-means, thereby inheriting the same hyperparameter tuning and fine-tuning constraints. Moreover, API-based LLMs do not provide direct access to their internal embeddings, limiting their adaptability in clustering applications.

To address these challenges, we propose a novel two-stage LLM-driven clustering framework that reframes text clustering as a classification task, leveraging the generative and reasoning capabilities of LLMs. Our approach consists of two key stages: label generation stage and text classification stage. The LLM processes input texts sequentially in mini-batches and generates meaningful labels based on content similarities. This dynamic label generation enables adaptive clustering without requiring predefined cluster numbers or embedding fine-tuning. Once labels are established, the LLM classifies the remaining texts according to the generated labels, effectively grouping similar texts without the need for conventional clustering algorithms. By transforming clustering into a classification task, our approach effectively addresses the core limitations of traditional clustering methods. First, it eliminates the need for fine-tuned embeddings, making the method highly adaptable across different datasets without requiring extensive model customization. Second, it avoids the necessity of manual hyperparameter tuning, thereby reducing reliance on expert knowledge and mitigating the risks associated with suboptimal parameter selection. Third, our framework enhances interpretability by leveraging the LLM’s capability to generate human-readable labels, providing meaningful insights into the resulting clusters. Lastly, by processing data in sequential mini-batches, it overcomes the input length limitations of LLMs, enabling efficient clustering of large-scale text datasets without compromising performance.

We evaluate our framework on five datasets across diverse NLP tasks, including topic mining, emotion detection, intent discovery, and domain classification, with cluster granularities ranging from 18 to 102. Our results demonstrate that the proposed approach achieves comparable or superior clustering performance compared to state-of-the-art methods while significantly reducing computational overhead and manual effort.

Our key contributions are summarized as follows:

∙\bullet We propose a novel LLM-driven framework that reframes clustering as a classification task, leveraging the label generation and reasoning capabilities of LLMs. Compared to recent LLM-based clustering methods, this automated clustering pipeline eliminates the need for fine-tuned embeddings and hyperparameter tuning.

∙\bullet By utilizing LLMs’ powerful summarization and classification capabilities, our framework generates high-quality, human-understandable labels, providing a practical and scalable alternative to traditional clustering approaches.

∙\bullet Extensive experiments on five diverse datasets demonstrate that our method achieves state-of-the-art results compare to recent LLM-based clustering methods, while being more computationally efficient and adaptable to various domains.

Related Work

2.1. Clustering

Clustering as a fundamental task in machine learning, has been applied to various data types, including texts (Beil et al., 2002; Aggarwal and Zhai, 2012; Xu et al., 2015; Ma and Zhuge, 2024), images (Yang et al., 2010; Chang et al., 2017; Wu et al., 2019; Ren et al., 2020; Park et al., 2021; Li et al., 2024), and graphs (Schaeffer, 2007; Zhou et al., 2009; Tian et al., 2014; Yin et al., 2017; Huang et al., 2024; Qi et al., 2024; Kalogeropoulos et al., 2025). Traditional clustering methods, such as K-means (Lloyd, 1982) and DBSCAN (Ester et al., 1996), have been widely applied in text clustering due to their simplicity and efficiency. K-means is an iterative partitioning algorithm that assigns data points to clusters based on their distance from centroids, requiring the number of clusters to be predefined. This limitation makes it less adaptable when the true number of clusters is unknown. DBSCAN, on the other hand, is a density-based clustering method that identifies clusters of arbitrary shape and does not require a predefined cluster number. However, it struggles with high-dimensional data, such as text embeddings, and requires careful tuning of distance thresholds. Both methods rely heavily on well-crafted feature representations, and their performance is sensitive to the choice of similarity measures and hyperparameters. Additionally, as they do not provide meaningful cluster labels, making it difficult to analyze the structure of clustered text groups.

In addition to research aimed at improving traditional machine learning algorithms for clustering (Liu et al., 2025; Ding et al., 2024; Fei et al., 2025), recent studies have increasingly focused on leveraging deep neural networks, which model instance similarities by learning meaningful representations (Huang et al., 2014; Guo et al., 2017; Bo et al., 2020; Zhou et al., 2022; Ren et al., 2023; Lee et al., 2024; Ros and Riad, 2024). For example, Yang et al. (2016) propose a recurrent network for joint unsupervised learning of deep representations in clustering. Caron et al. (2018) jointly learn the parameters of neural networks and the cluster assignments of the resulting features. Tao et al. (2021) combine instance discrimination and feature decorrelation to present a clustering-friendly representation learning method. While these methods have demonstrated strong performance, they require an additional training process to obtain feature representations, followed by the application of traditional clustering algorithms (Guan et al., 2020).

The reliance of these approaches on extensive training constrains their adaptability across datasets, as models must be retrained for each new domain, leading to significant computational costs. In contrast, our framework of transferring text clustering into classification task eliminates the need for fine-tuning or embedding-specific training, enabling seamless adaptation to diverse datasets without incurring additional computational overhead.

2.2. Adding Explanations to Text Clusters

While previous clustering algorithms do not necessarily produce interpretable clusters (Chang et al., 2009), studies pay attention to explaining the clusters with semantically meaningful expressions (Zhang et al., 2015; Yang et al., 2025). Treeratpituk and Callan (2006) assign labels to hierarchical clusters and assesses potential labels by utilizing information from the cluster itself, its parent cluster, and corpus statistics; Carmel et al. (2009) propose a framework that selects candidate labels from external resources like Wikipedia to represent the content of the cluster; Navigli and Crisafulli (2010) induce word senses when clustering the result based on their semantic similarity; Zhang et al. (2018) iteratively identify general terms and refines the sub-topics during clustering to split coarse topics into fine-grained ones. However, label or phrase level added information is limited in describing a complex cluster (Wang et al., 2023), and labels assigned may have similar meanings, resulting in overlapping labels. Thus, more in-depth expressions and better granularity control are required to make clusters more explainable and accurate.

By utilizing LLMs to generate interpretable cluster labels, our method enhances the explainability of clustering results, providing meaningful insights into the grouped data. This formulation not only improves clustering quality but also significantly reduces the complexity of the clustering process.

2.3. Text Clustering using LLMs

Recent rapid development of Large Language Models (LLMs), such as GPT series (Brown et al., 2020; Ouyang et al., 2022; OpenAI, 2023), has demonstrated the powerful comprehensive language capability of LLMs and some works has been using LLMs in text clustering task. Wang et al. (2023) utilize LLMs to propose explanations for the cluster and classify the samples based on the generated explanations; De Raedt et al. (2023) collect descriptive utterance labels from LLMs with well-chosen prototypical utterances to bootstrap in-context learning; Kwon et al. (2023) use LLMs to label the description of input data and cluster the labels with given K. Besides explanation and label generation, Viswanathan et al. (2024) expand documents’ keyphrases, generate pairwise constraints and correct low-confidence points in the clusters via LLMs, Zhang et al. (2023) leverage feedbacks from LLMs to improve smaller embedders, such as Instructor (Su et al., 2022) and E5 (Wang et al., 2022), and prompt LLMs for helps on clustering granularity.

All these methods use LLMs in an indirect way that LLMs only process part of the input data and do not see the whole dataset. In contrast, our proposed framework dynamically generates cluster labels and assigns data points in a sequential manner, mitigating the challenges of high-dimensional text clustering and leveraging the full advantages of generative and reasoning capabilities of LLMs.

Methodology

In this work, we propose a two-stage framework that utilizes a single LLM for text clustering tasks. To better leverage the generative and classification capabilities of LLMs, we transform the clustering task into a label-based classification task, allowing the LLM to process the data more effectively. As illustrated in Figure 1 and summarized in Algorithm 1, unlike previous text clustering methods such as ClusterLLM (Zhang et al., 2023) that calculate distances between data points in vector space, our framework does not require fine-tuning for better representation or a pre-assigned cluster number K K. We first prompt the LLM to generate potential labels for the data. After merging similar labels, we then prompt the LLM to classify the input data based on these generated labels. The detailed steps of our framework are introduced in the following sections.

3.1. Task Definition

Figure 1. A comparison between other methods using LLMs (left) and our framework (right) for text clustering. Our framework transforms the clustering task into a text classification task by generating potential labels (Stage 1) and classifying input sentences according to the labels (Stage 2) using LLMs.††:

For text clustering, given an unlabeled dataset 𝒟={d i}i=1 N\mathcal{D}={d_{i}}{i=1}^{N}, where N N is the size of the corpus, the goal is to output K K subsets of 𝒟\mathcal{D} as 𝒞={c j}j=1 K\mathcal{C}={c{j}}{j=1}^{K}, where K K represents the number of clusters and each c j c{j} represents a cluster, such that d 1∈c j d_{1}\in c_{j} and d 2∈c j d_{2}\in c_{j} if d 1 d_{1} and d 2 d_{2} belong to the same cluster. We transform text clustering task into classification task in this work. Specifically, given the dataset 𝒟\mathcal{D}, the model first generates a set of labels ℒ={l k}k=1 K′\mathcal{L}={l_{k}}{k=1}^{K^{\prime}} based on the content of the dataset, where K′K^{\prime} is the number of labels. Subsequently, each data d i∈𝒟 d{i}\in\mathcal{D} will be classified into one of the labels l∈ℒ l\in\mathcal{L} and the input dataset will be clustered into K′K^{\prime} clusters 𝒞′={c j′}j=1 K′\mathcal{C}^{\prime}={c_{j}^{\prime}}_{j=1}^{K^{\prime}}.

3.2. Label Generation Using LLMs

In this section, we explore the process of forming a label-generation task to obtain potential labels for clusters using LLMs. Given the few-shot capabilities of LLMs (Brown et al., 2020), we will provide several example label names to fully utilize the in-context learning ability of LLMs.

3.2.1. Potential Label Generation

Since inputting an entire dataset into LLMs is impractical due to context length limitations, we input the dataset in mini-batches and then aggregate the potential labels. Subsequently, we prompt the model to merge similar labels to adjust the granularity of the clusters. Specifically, given a batch size B B, we will first prompt the LLM with B B instances along with n n example label names to generate potential labels for the input data using a prompt 𝒫 g\mathcal{P}_{g}, where the dataset is divided into N B\frac{N}{B} mini-batches for processing:

(1)ℒ′\displaystyle\mathcal{L}^{\prime}=𝒫 g(ℐ generate,𝒟′,l)\displaystyle=\mathcal{P}{g}(\mathcal{I}{\text{generate}},\mathcal{D}^{\prime},l)

where ℐ generate\mathcal{I}{\text{generate}} is the label generation task instruction, 𝒟′={d i}i=1 B\mathcal{D}^{\prime}={d{i}}_{i=1}^{B} is the input data in mini-batches of the size B B, and l l represents the n n given label names.

3.2.2. Potential Labels Aggregation and Mergence

After obtaining all the potential labels from LLMs, we aggregate the labels generated from each mini-batch together:

(2)ℒ unique={l∣l∈ℒ′}\begin{split}\mathcal{L}_{\text{unique}}={l\mid l\in\mathcal{L}^{\prime}}\end{split}

To avoid redundant duplication of final clusters caused by the LLM producing different descriptions for the same label, we further prompt the LLM to merge labels with similar expressions:

(3)ℒ=𝒫 m(ℐ merge,ℒ unique)\begin{split}\mathcal{L}&=\mathcal{P}{m}(\mathcal{I}{\text{merge}},\mathcal{L}_{\text{unique}})\end{split}

where ℐ merge\mathcal{I}_{\text{merge}} is the instructions of the merging task.

3.3. Given Label Classification

Given the potential labels for the entire dataset, we can now obtain the final clusters by performing label classification using LLMs. For each input instance, we prompt the LLM to assign a label from the previously generated potential labels:

(4)c j′=𝒫 a(ℐ assign,d j,ℒ)\begin{split}c_{j}^{\prime}=\mathcal{P}{a}(\mathcal{I}{\text{assign}},d_{j},\mathcal{L})\end{split}

where c j′c_{j}^{\prime} is the cluster that the LLM classifies d j d_{j} into and ℐ assign\mathcal{I}{\text{assign}} is the instruction of the assigning task. After assigning all the data in the dataset according to the labels, we finally get the text clustering result 𝒞′={c j′}j=1 K′\mathcal{C}^{\prime}={c{j}^{\prime}}{j=1}^{K^{\prime}}. For the detailed prompt template and instructions ℐ generate\mathcal{I}{\text{generate}}, ℐ merge\mathcal{I}{\text{merge}}, and ℐ assign\mathcal{I}{\text{assign}}, please refer to the Appendix.

Algorithm 1 LLM-based Text Clustering as Classification

0: Unlabeled dataset

D={d 1,d 2,…,d N}D={d_{1},d_{2},\ldots,d_{N}} , batch size

B B , few-shot labels

L few L_{\text{few}}

0: Clusters

C′={c 1′,c 2′,…,c K′′}C^{\prime}={c^{\prime}{1},c^{\prime}{2},\ldots,c^{\prime}_{K^{\prime}}}

1: Initialize

L all←∅L_{\text{all}}\leftarrow\emptyset

2: Split

D D into

⌈N/B⌉\lceil N/B\rceil mini-batches:

{D 1,D 2,…,D M}{D_{1},D_{2},\ldots,D_{M}}

3:for each batch

D b∈{D 1,D 2,…,D M}D_{b}\in{D_{1},D_{2},\ldots,D_{M}} do

4: Prompt LLM with

(D b,L few)(D_{b},L_{\text{few}}) using

P g P_{g} to generate potential labels

L b L_{b}

L all←L all∪L b L_{\text{all}}\leftarrow L_{\text{all}}\cup L_{b}

6:end for

L unique←unique(L all)L_{\text{unique}}\leftarrow\text{unique}(L_{\text{all}})

L final←L_{\text{final}}\leftarrow Prompt LLM with

L unique L_{\text{unique}} using

P m P_{m} to merge similar labels

9: Initialize

C′←∅C^{\prime}\leftarrow\emptyset

10:for each text

d∈D d\in D do

11:

c d←c_{d}\leftarrow Prompt LLM with

(d,L final)(d,L_{\text{final}}) using

P a P_{a} to assign a label

12: Add

d d to cluster

c d c_{d} in

C′C^{\prime}

13:end for

14:return

C′C^{\prime}

Experiment

4.1. Dataset Description

We extensively evaluate our framework on five datasets encompassing diverse tasks, including topic mining, emotion detection, intent discovery, and domain discovery. Each dataset has different granularities, ranging from 18 to 102 clusters.

ArxivS2S(Muennighoff et al., 2023) is a text clustering dataset in the domain of academic, it contains sentences describing a certain domain. GoEmo(Demszky et al., 2020) is a fine-grained dataset for emotion detection, multi-label or neutral instances are removed for text clustering purpose. Massive-I/D(FitzGerald et al., 2023) and MTOP-I(Li et al., 2021) are datasets originally used for classification but adapted for text clustering. “I” denotes intent and “D” denotes domain. Following Zhang et al. (2023), all the datasets are splitted into large- and small-scale versions with the same number of clusters. Dataset statistics summary is shown in Table 1. We use small-scale version of datasets to reduce cost.

Table 1. Dataset statistics. #clusters denotes the number of true label clusters, while #data represents the number of instances within each cluster.

4.2. Implementation Details

We use GPT-3.5-turbo as the query LLM for label generation and given label classification. Responses are controlled by adding a postfix: “Please response in JSON format”. Detailed prompts and instructions are provided in Appendix A. We then extract the labels from the JSON response. During label generation, label names are provided to the LLM as examples. We set the number of given label names to 20% of the total number of labels in the dataset. To account for context length limitations, we set the mini-batch size B B to 15, meaning the LLM receives 15 input sentences at a time to generate potential labels.

4.3. Evaluation Metrics

Following (De Raedt et al., 2023) and (Zhang et al., 2023), we evaluate clustering quality using three metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). Accuracy measures how well the predicted clusters align with the true labels, which requires addressing the inherent lack of ordering in clustering labels. To solve this, the Hungarian algorithm (Kuhn, 1955) is used to find an optimal mapping between predicted and true labels. Once aligned, ACC is calculated as the proportion of correctly assigned labels. NMI, on the other hand, quantifies the similarity between the true and predicted clusters by using mutual information to measure how much information about the true labels can be gained from the predicted clusters. This is then normalized by the average entropy of the two label sets, making NMI robust to differences in cluster sizes and independent of whether true or predicted clusters are treated as the ground truth. Lastly, ARI evaluates clustering quality by comparing pairs of samples, extending the Rand Index by accounting for the agreement expected by random chance. This adjustment ensures that ARI values close to zero indicate clustering performance no better than random, with negative values suggesting worse-than-random clustering and positive values reflecting better clustering.

Table 2. Experiment results of text clustering on five datasets, evaluated using Accuracy, NMI, and ARI. Best results are highlighted in bold. LLM_known_labels represents the theoretical upper bound for LLMs in this task. * indicates significant improvement under statistical significance tests with p<0.05 p<0.05.

4.4. Compared Baselines

To demonstrate that our framework of using LLMs directly without embedding or fine-tuning can improve the text clustering results, other than traditional clustering algorithm, we compare our result with methods that also utilize LLMs to different extents. Since different models all evaluated on different datasets, to better compare the performance of baseline models and our framework, we implement the baseline models on the five datasets using the source code provided by the authors.

K-means/DBSCAN. We use embeddings extracted from E5-large (Wang et al., 2022) and Instructor-large (Su et al., 2022) and apply K-means/DBSCAN algorithm to obtain the text clustering result. We run the clustering five times with different seeds and calculate the average result as the final result.

IDAS 2 2 2https://github.com/maarten-deraedt/IDAS-intent-discovery-with-abstract-summarization.(De Raedt et al., 2023) identifies prototypes that represent the latent intents and independently summarizes them into labels using LLMs. Then, it encodes the concatenation of sentences and summaries for clustering. We first generate labels using GPT-3 (text-davinci-003) (Brown et al., 2020) for the five datasets used in this paper. For each test set, five JSON files are generated with different sample order, with the nearest neighbors topk=8 topk=8. After that, we produce the result with the generated labels and calculate the evaluation metrics.

PAS 3 3 3https://github.com/ZihanWangKi/GoalEx.(Wang et al., 2023) develops a three-stage algorithm Propose-Assign-Select by prompting LLMs to generate goal-related explanations, determine whether each explanation supports each sample, and use integer linear programming to select clusters such that each sample belongs to one single cluster. We use the same experiment settings as (Wang et al., 2023) and use GPT-3.5-turbo as the proposers and google/flan-t5-xl 4 4 4https://huggingface.co/google/flan-t5-xl. as the assigners. For cluster_num parameter, we set it as the number of labels in the datasets.

Keyphrase Clustering is the best performing clustering model proposed by Viswanathan et al. (2024), which expands the expression by generating keyphrases using LLM.

ClusterLLM 5 5 5https://github.com/zhang-yu-wei/ClusterLLM(Zhang et al., 2023) prompts LLM for insights on similar data points and fine-tunes small embedders using the LLM’s choice. It also uses GPT-3.5 to guide the clustering granularity by determining whether two data points belong to the same category. Since ClusterLLM does not present its results in the ARI metric, we also reproduce its results on the five datasets. We choose the best performing model ClusterLLM-I-iter for comparison. This model adopts Instructor 6 6 6https://huggingface.co/hkunlp/instructor-large as the embedder and applies the framework twice in an iterative way by using the previously fine-tuned model as initialization. The LLM used for triplet sampling and pairwise hierarchical sampling is GPT-3.5-turbo. We also re-perform the framework on ArxivS2S and GoEmo datasets to obtain the #clusters result in granularity analysis in Section 5.2, which is not presented in the original paper. The #clusters result of dataset Massive-I, Massive-D, and MTOP-I is taken directly from the paper (Zhang et al., 2023).

Additionally, we apply our framework with gold labels given, which performs label classification using the dataset’s ground truth cluster labels, shown as LLM_known_labels. This model represents the upper bound of the LLM’s performance.

Results

5.1. Text Clustering Results

We present the text clustering results in Table 2 and draw several key observations from the experimental findings.

Firstly, our proposed framework consistently outperforms baseline approaches across all datasets, with very few exceptions. For example, our framework achieves a significant accuracy improvement of 12.44% on the ArxivS2S dataset and even doubles the performance on MTOP-I. These results highlight the robustness and effectiveness of leveraging LLMs exclusively for text clustering tasks.

Furthermore, the observed improvements across three distinct evaluation metrics demonstrate that our framework enhances text clustering performance from multiple dimensions. This indicates that our framework not only excels at accurately identifying and differentiating distinct categories but also effectively captures the intrinsic relationships and underlying structures within the data. The consistent enhancement across metrics underscores the comprehensive impact of our framework on clustering quality.

In addition, the performance of our framework is remarkably close to the theoretical upper bound (LLM_known_labels), which uses ground truth cluster labels for classification. Achieving near-upper-bound performance without access to true labels demonstrates the capability of our framework to generate meaningful potential labels and refine cluster granularity through effective label merging. This underscores the ability of our framework to balance interpretability and accuracy, providing a practical and scalable alternative to traditional clustering techniques.

Overall, these results validate the strength of our framework in addressing the challenges of unsupervised text clustering, emphasizing its potential for broader applications and its reliability across diverse datasets. Our findings also highlight the significant role that LLMs can play in simplifying and improving text clustering by fully utilizing their in-context learning and generalization capabilities.

5.2. Granularity Analysis

To assess the granularity of the output clusters, we compare the final cluster number generated by our framework with those produced by ClusterLLM.

Table 3. Granularity analysis. The results are presented in the format of “#clusters”, where a positive difference means the model generate more #clusters than ground truth and vice versa.

We first compare our framework’s cluster granularity with those produced by ClusterLLM to justify the effectiveness of label merging task in our framework. The smaller absolute difference in Table 3 demonstrates that our framework yields cluster counts that are more closely aligned with the true number of clusters. This improved alignment with the real cluster distribution underscores the effectiveness of our framework in better capturing the underlying data structure by merging labels with similar semantic meanings. As a result, this enhances cluster coherence and validity.

Figure 2. Label merging granularity on five datasets. “GT #Clusters” means the ground truth number of clusters in the dataset.††:

To demonstrate how our proposed framework handle ambiguous or overlapping categories during clustering process, we also conduct an comparative analysis on granularity before and after the merging task. Figure 2 shows that merging similar labels helps the model aggregate labels with same meanings, resulting in a cluster number closer to the the ground truth clusters. This merging method is especially effective when the number of labels is larger. For example, it aggregates 21 similar labels in the ArxivS2S dataset. Since the number of clusters can heavily impact the final clustering result, this method of improving the granularity is necessary.

This closer alignment with the actual cluster distribution highlights our framework’s ability in more accurately capturing the underlying structure of the data through merging labels that have similar semantic meanings. Consequently, this leads to improved cluster coherence and validity. The ablation test regarding label merging task in Figure 2 supports this conclusion. It compares the cluster granularity before and after the merging task and shows that performing label merging task can help the model aggregate similar labels and output a cluster number that is closer to the ground truth.

5.3. Prompt Variation Analysis

Table 4. Prompt Variation Results. “Prompt1” and “Prompt2” refers to different prompt variation compared to original prompt used in our framework.

Prompt quality plays a critical role in guiding LLMs to perform effectively in downstream tasks. To assess the impact of prompt phrasing, we conduct experiments using various formulations for prompts during both the label generation and text classification stages. The results of these experiments, summarized in Table 4, reveal that while changes in prompt wording lead to minor performance fluctuations, these variations are not significant enough to alter the overall outcomes. Importantly, the performance of our proposed framework remains robust across different prompt expressions, consistently outperforming the current state-of-the-art model, ClusterLLM, in most scenarios. This consistent superiority highlights not only the effectiveness of our framework but also its adaptability to varying prompt structures. The ability to maintain strong performance despite changes in prompts underscores the generalizability of our framework, making it a reliable approach for diverse text clustering tasks. These findings reinforce the importance of prompt engineering while also demonstrating that our framework reduces dependence on precise prompt tuning, a common challenge in deploying LLMs for real-world applications.

5.4. Few-shot Label Generation

We provide the LLM with few-shot examples in the label generation task to better exploit its in-context learning capability. By observing a small number of gold labels, the LLM can infer the underlying label semantics and generate more coherent and meaningful candidate labels. To quantify this effect, we conduct experiments with varying percentages of gold labels, as illustrated in Figure 3. The 100% case represents a theoretical upper bound (“LLM_known_labels” in Section 4.4), where all true labels are given, and the LLM performs direct classification. We find that even a small number of examples (e.g., 10–15%) consistently boosts clustering performance across all metrics and datasets. This result highlights the importance of carefully designed few-shot prompts and validates our framework’s strategy of leveraging example labels in the label generation step (Section 3.2.1). Importantly, even under the 0% setting where the LLM receives no few-shot examples, it still outperforms baseline methods on most datasets. This fully unsupervised scenario highlights the strong intrinsic capability of LLMs to generate meaningful candidate labels and perform classification without external supervision.

Figure 3. ACC, NMI, ARI of our framework on five dataset with different percentage of given labels. 0% means no label is provided to the LLM, 20% means we give 20% of the total gold labels to the LLM during label generation and 100% means LLM is provided with all true labels and directly performs classification.††:

5.5. Hyperparameter Sensitivity Analysis

We conduct extensive experiments to evaluate the influence of key hyperparameters on our framework’s performance, with a particular focus on the batch size B B and the percentage of provided labels used for in-context learning. In addition to the default batch size of 15, we investigate the impact of smaller and larger batches by testing values of 10 and 20. As reported in Table 5, the results demonstrate that varying the batch size does not significantly affect the overall performance trend of our framework. This observation indicates that our approach is relatively robust to changes in batch-level granularity, suggesting that the LLM’s reasoning ability is not heavily dependent on the exact number of samples seen in each batch.

One possible reason for this robustness is that the core task — generating and merging meaningful labels — is driven by semantic understanding rather than batch statistics. Since each batch is processed independently by the LLM with self-contained prompts, altering B B mainly influences computational efficiency rather than the quality of label generation or classification. Larger batch sizes (e.g., B=20 B=20) slightly increase the context size but do not introduce additional semantic information, while smaller batches (e.g., B=10 B=10) simply reduce the number of examples the model observes in one pass without degrading performance. Consequently, the choice of B B can be guided more by resource and latency considerations rather than accuracy concerns, making our framework more flexible in real-world applications.

Moreover, we explore the effect of varying the proportion of provided labels on performance by testing three different settings: 10%, 15%, and 25%. Table 6 shows that our framework benefits notably from the presence of a moderate number of given labels, leveraging the in-context learning capability of LLMs. When the number of provided label examples is small (e.g., 10%), the performance improves steadily as more few-shot examples are included (e.g., 15%). However, when an excessive number of examples are introduced (25%), the model’s performance slightly deteriorates. We hypothesize that this decline is due to prompt overload, where too many examples cause the model to lose focus on the underlying clustering logic or overfit to the examples rather than generalizing across the dataset. These findings emphasize the importance of carefully balancing the number of in-context examples to achieve optimal performance.

Table 5. Experiments on different batch size B B. We use batch size of 15 in our presented method.

Table 6. Experiments on different given label percentage. We use 20% in our presented method.

5.6. Cost Comparison

Table 7. Cost/time comparison between API-based method (API) and fine-tuning-based approach (FT). N N represents the size of evaluation data. All times are reported in minutes.

We report a comparison of the monetary cost and wall-clock time between our proposed API-based method and a fine-tuning-based approach in Table7.

Let N N denote the size of the evaluation dataset, and let D D be the size of the training set with sequence length L L, trained for E E epochs. The input and output token lengths for the LLM are represented by T in T_{\text{in}} and T out T_{\text{out}}, respectively. Since API calls can be executed in parallel, we assume a parallel throughput of R R requests per second.

Throughput estimation.

We estimate throughput for fine-tuning and inference based on the effective processing rate of transformer encoders on 4×\times A100 GPUs (80GB). The training throughput is

(5)τ train=B⋅G t iter,τ inf=B inf⋅G t iter inf,\tau_{\text{train}}=\frac{B\cdot G}{t_{\text{iter}}},\qquad\tau_{\text{inf}}=\frac{B_{\text{inf}}\cdot G}{t_{\text{iter}}^{\text{inf}}},

where B B is the per-GPU training batch size, B inf B_{\text{inf}} is the per-GPU inference batch size, G G is the number of GPUs, t iter t_{\text{iter}} is the iteration time during training, and t iter inf t_{\text{iter}}^{\text{inf}} is the iteration time during forward-only inference. In practice, with B=64 B=64, G=4 G=4, and t iter≈0.85 t_{\text{iter}}\approx 0.85 s for sequence length L=512 L=512, we obtain τ train≈64×4/0.85≈300\tau_{\text{train}}\approx 64\times 4/0.85\approx 300 samples/s. For inference, with B inf=128 B_{\text{inf}}=128 and t iter inf≈0.5 t_{\text{iter}}^{\text{inf}}\approx 0.5 s, we obtain τ inf≈128×4/0.5≈1000\tau_{\text{inf}}\approx 128\times 4/0.5\approx 1000 requests/s. These values are consistent with reported throughput benchmarks for BERT-style encoders on A800 GPUs and represent conservative but realistic estimates that balance compute, memory, and data loader efficiency.

API-based method.

The API cost and time are given by

(6)Cost API=N 10 6(T in⋅0.50+T out⋅1.50),t API=N R.\text{Cost}{\text{API}}=\frac{N}{10^{6}}\Bigl(T{\text{in}}\cdot 0.50+T_{\text{out}}\cdot 1.50\Bigr),\qquad t_{\text{API}}=\frac{N}{R}.

Note that the batch size does not affect the total token usage, and therefore does not appear in the formula.

Fine-tuning-based method.

The training time and cost are

(7)t train=D⋅E τ train,Cost train=t train 3600⋅4⋅C GPU,t_{\text{train}}=\frac{D\cdot E}{\tau_{\text{train}}},\qquad\text{Cost}{\text{train}}=\frac{t{\text{train}}}{3600}\cdot 4\cdot C_{\text{GPU}},

where C GPU C_{\text{GPU}} is the hourly rental price per GPU. Inference requires

(8)t inf=N τ inf,Cost inf=t inf 3600⋅4⋅C GPU.t_{\text{inf}}=\frac{N}{\tau_{\text{inf}}},\qquad\text{Cost}{\text{inf}}=\frac{t{\text{inf}}}{3600}\cdot 4\cdot C_{\text{GPU}}.

The total cost and time for the fine-tuned model are therefore

(9)Cost FT=Cost train+Cost inf,t FT=t train+t inf.\text{Cost}{\text{FT}}=\text{Cost}{\text{train}}+\text{Cost}{\text{inf}},\qquad t{\text{FT}}=t_{\text{train}}+t_{\text{inf}}.

From the results, we observe that our API-based method scales linearly with the evaluation size N N and requires no setup cost, making it attractive for quick, small-scale experiments. In contrast, the fine-tuning approach incurs a one-time training overhead proportional to the size of the training data, but after training, its inference cost grows linearly with N N at a much smaller constant factor due to high GPU throughput. Consequently, for small N N, the API-based method is cheaper and faster to deploy. At moderate N N, the API-based approach is faster in wall-clock time but incurs a higher cost than fine-tuning. However, as N N increases further, the fine-tuned approach becomes more cost-effective, since the upfront training expense is amortized and inference can be carried out at low marginal cost.

Conclusion

We propose a novel approach to text clustering that leverages LLMs exclusively, eliminating the need for additional embedding models or traditional clustering algorithms. Our two-stage framework reframes the text clustering problem as a combination of label generation and classification tasks, enhancing both the performance and interpretability of clustering results by assigning meaningful, human-readable labels to the clusters. In the first stage, we prompt LLMs with batches of input sentences to generate explainable and contextually relevant potential labels. To ensure consistency and clarity, similar labels are merged into a unified set of candidate cluster labels. In the second stage, we classify each input sentence into one of these refined labels using the LLM, effectively completing the clustering process. This framework capitalizes on the advanced natural language understanding, generation, and classification capabilities of LLMs without requiring any additional fine-tuning or task-specific training. The comprehensive knowledge encoded in LLMs from pre-training on diverse and extensive datasets significantly enhances our framework’s domain adaptability, making it well-suited for clustering tasks across various applications and industries.

Extensive experimental results demonstrate the effectiveness of our framework, showcasing superior clustering performance and improved granularity compared to state-of-the-art methods. Furthermore, this framework’s simplicity and adaptability make it a promising solution for applications requiring scalable and interpretable text clustering. In the future, we aim to explore more cost-efficient strategies and finer-grained clustering methods, leveraging the evolving capabilities of LLMs to further enhance performance and reduce resource consumption.

Table 8. Prompt template and instructions used in this paper. In this template, words inside {} should be replaced by corresponding variables during experiments.

Limitation

Our work has limitations in the following senses. First, as our work relies exclusively on LLMs for text clustering and does not fine-tune smaller embedders for better representation, more processing is required through LLMs. This results in increased API usage and higher associated costs. Since we use the LLM for given-label classification, the number of API calls is proportional to the dataset size. While the savings in computational costs can offset a significant portion of this API cost increase, this remains a cost limitation when dealing with large datasets. Second, while our framework achieves better granularity in clustering results compared to other LLM-based methods like ClusterLLM, it still lacks fine-grain control. Third, during the label generation process, without explicit guidelines or standardization protocols, LLMs might produce labels that vary widely in phrasing and granularity. To manage this, we apply a merging process using the LLM to control the granularity of the labels generated. However, LLMs might not consistently merge synonymous labels or accurately distinguish between polysemous words, leading to fragmented clusters. Additionally, labels or words with multiple meanings could result in ambiguous labeling. This issue can be mitigated by adding explicit explanations for the generated labels.

Future Work

In future work, we aim to enhance our framework by incorporating user feedback to improve label accuracy and granularity, leveraging human expertise and knowledge. By allowing users to provide feedback on generated labels, the LLM can refine label quality and better manage granularity. Since our framework is built on LLMs, this interaction can be efficiently facilitated through text, making it accessible to users without algorithmic expertise. This approach is particularly advantageous in real-world scenarios, where integrating expert knowledge can be highly beneficial. Additionally, improving the stability and consistency of responses to different prompts remains a broader challenge in LLM-based systems. We recognize the significance of this issue and will continue to explore strategies to further enhance prompt stability and consistency within our framework.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (72204087), the Shanghai Planning Office of Philosophy and Social Science Youth Project (2022ETQ001), the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission (23CGA28), the Shanghai Pujiang Program (23PJC030), Young Elite Scientists Sponsorship Program by CAST (YESS20240562), and the Fundamental Research Funds for the Central Universities, China. We also appreciate the constructive comments from the anonymous reviewers.

Appendix

Appendix A Prompt template

We design different prompt templates (𝒫g\mathcal{P}g, 𝒫m\mathcal{P}m, 𝒫a\mathcal{P}a) and instructions (ℐgenerate\mathcal{I}{\text{generate}}, ℐmerge\mathcal{I}{\text{merge}}, ℐassign\mathcal{I}{\text{assign}}) for the three sub-tasks in our framework: label generation, label aggregation & merging, and given-label classification. Each template is carefully constructed to guide the LLMs toward producing high-quality, task-specific outputs with minimal ambiguity. Table 8 provides an overview of the prompt templates and corresponding instructions for each task, illustrating how we tailor the query to maximize the LLM’s performance.

To improve the reliability and consistency of responses, we integrate format-control instructions into the prompts. For example, we explicitly include directives such as “Please return the output in JSON format” and provide a concrete JSON structure example. This approach ensures that the LLM outputs are not only correct but also well-structured for downstream processing.

References

(1)
Agarwal et al. (2020) Neha Agarwal, Geeta Sikka, and Lalit Kumar Awasthi. 2020. Enhancing web service clustering using Length Feature Weight Method for service description document vector space representation. Expert Systems with Applications 161 (2020), 113682.
Aggarwal and Zhai (2012) Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. Mining text data (2012), 77–128.
Anick and Vaithyanathan (1997) Peter G Anick and Shivakumar Vaithyanathan. 1997. Exploiting clustering and phrases for context-based information retrieval. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval. 314–323.
Beil et al. (2002) Florian Beil, Martin Ester, and Xiaowei Xu. 2002. Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 436–442.
Bo et al. (2020) Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020. Structural deep clustering network. In Proceedings of the web conference 2020. 1400–1410.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Carmel et al. (2009) David Carmel, Haggai Roitman, and Naama Zwerdling. 2009. Enhancing cluster labeling using wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 139–146.
Caron et al. (2018) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV). 132–149.
Castellanos et al. (2017) Ángel Castellanos, Juan Cigarrán, and Ana García-Serrano. 2017. Formal concept analysis for topic detection: a clustering quality experimental analysis. Information Systems 66 (2017), 24–42.
Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-Graber, and David Blei. 2009. Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems 22 (2009).
Chang et al. (2017) Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. 2017. Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision. 5879–5887.
Cutting et al. (1993) Douglass R Cutting, David R Karger, and Jan O Pedersen. 1993. Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. 126–134.
Cutting et al. (2017) Douglass R Cutting, David R Karger, Jan O Pedersen, and John W Tukey. 2017. Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR Forum, Vol.51. ACM New York, NY, USA, 148–159.
De Raedt et al. (2023) Maarten De Raedt, Fréderic Godin, Thomas Demeester, Chris Develder, and Sinch Chatlayer. 2023. IDAS: Intent Discovery with Abstractive Summarization. In The 5th Workshop on NLP for Conversational AI. 71.
Demszky et al. (2020) Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A Dataset of Fine-Grained Emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4040–4054.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Ding et al. (2024) Sifan Ding, Min Li, Tianyi Huang, and William Zhu. 2024. Local density based on weighted K-nearest neighbors for density peaks clustering. Knowledge-Based Systems 305 (2024), 112609.
Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, Vol.96. 226–231.
Fei et al. (2025) Zexuan Fei, Haoyu Zhai, Jie Yang, Bin Wang, and Yan Ma. 2025. Discovering generalized clusters with adaptive mixture density-based clustering. Knowledge-Based Systems (2025), 113250.
FitzGerald et al. (2023) Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, et al. 2023. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4277–4302.
Guan et al. (2020) Renchu Guan, Hao Zhang, Yanchun Liang, Fausto Giunchiglia, Lan Huang, and Xiaoyue Feng. 2020. Deep feature-based text clustering and its explanation. IEEE Transactions on Knowledge and Data Engineering 34, 8 (2020), 3669–3680.
Guo et al. (2017) Xifeng Guo, Xinwang Liu, En Zhu, and Jianping Yin. 2017. Deep clustering with convolutional autoencoders. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24. Springer, 373–382.
He and Huang (2025) Guoxiu He and Chen Huang. 2025. Few-shot medical relation extraction via prompt tuning enhanced pre-trained language model. Neurocomputing 633 (2025), 129752.
He et al. (2025) Guoxiu He, Meicong Zhang, Tiancheng Su, Li Ma, and Xiaomin Zhu. 2025. Enhancing belief consistency of Large Language Model agents in decision-making process based on attribution theory. Expert Systems with Applications (2025), 129273.
Huang et al. (2014) Peihao Huang, Yan Huang, Wei Wang, and Liang Wang. 2014. Deep embedding network for clustering. In 2014 22nd International conference on pattern recognition. IEEE, 1532–1537.
Huang et al. (2024) Xin Huang, Fan Yang, Guanqiu Qi, Yuanyuan Li, Ranqiao Zhang, and Zhiqin Zhu. 2024. Deep attributed graph clustering with feature consistency contrastive and topology enhanced network. Knowledge-Based Systems 305 (2024), 112634.
Johnson (1967) Stephen C Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32, 3 (1967), 241–254.
Kalogeropoulos et al. (2025) Nikitas-Rigas Kalogeropoulos, George Kontogiannis, and Christos Makris. 2025. Spectral clustering and query expansion using embeddings on the graph-based extension of the set-based information retrieval model. Expert Systems with Applications 263 (2025), 125771.
Kuhn (1955) Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83–97.
Kwon et al. (2023) Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. 2023. Image Clustering Conditioned on Text Criteria. arXiv preprint arXiv:2310.18297 (2023).
Lee et al. (2024) Sangho Lee, Chihyeon Choi, and Youngdoo Son. 2024. Deep time-series clustering via latent representation alignment. Knowledge-Based Systems 303 (2024), 112434.
Li et al. (2021) Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. 2021. MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2950–2962.
Li et al. (2024) Shu Li, Lixin Han, Yang Wang, Yonglin Pu, Jun Zhu, and Jingxian Li. 2024. Contrastive clustering based on generalized bias-variance decomposition. Knowledge-Based Systems 305 (2024), 112601.
Liu et al. (2025) Zhe Liu, Haoye Qiu, Muhammet Deveci, Witold Pedrycz, and Patrick Siarry. 2025. Multi-view neutrosophic c-means clustering algorithms. Expert Systems with Applications 260 (2025), 125454.
Lloyd (1982) Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
Ma and Zhuge (2024) Bing Ma and Hai Zhuge. 2024. Automatic construction of classification dimensions by clustering texts based on common words. Expert Systems with Applications 238 (2024), 122292.
Mehta et al. (2021) Vivek Mehta, Seema Bawa, and Jasmeet Singh. 2021. Stamantic clustering: combining statistical and semantic features for clustering of large text datasets. Expert Systems with Applications 174 (2021), 114710.
Muennighoff et al. (2023) Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037.
Navigli and Crisafulli (2010) Roberto Navigli and Giuseppe Crisafulli. 2010. Inducing word senses to improve web search result clustering. In Proceedings of the 2010 conference on empirical methods in natural language processing. 116–126.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
Pang et al. (2025) Junbiao Pang, Anjing Hu, and Qingming Huang. 2025. Bundle fragments into a whole: Mining more complete clusters via submodular selection of interesting webpages for web topic detection. Expert Systems with Applications 260 (2025), 125125.
Park et al. (2021) Sungwon Park, Sungwon Han, Sundong Kim, Danu Kim, Sungkyu Park, Seunghoon Hong, and Meeyoung Cha. 2021. Improving unsupervised image clustering with robust learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12278–12287.
Qi et al. (2012) Guo-Jun Qi, Charu C Aggarwal, and Thomas Huang. 2012. Community detection with edge content in social media networks. In 2012 IEEE 28th International conference on data engineering. IEEE, 534–545.
Qi et al. (2024) Tingting Qi, Xiangchu Feng, Bian Gao, and Kun Wang. 2024. An end-to-end Graph Convolutional Network for Semi-supervised Subspace Clustering via label self-expressiveness. Knowledge-Based Systems 286 (2024), 111393.
Ren et al. (2023) Lina Ren, Yongbin Qin, Yanping Chen, Chuan Lin, and Ruizhang Huang. 2023. Deep document clustering via adaptive hybrid representation learning. Knowledge-Based Systems 281 (2023), 111058.
Ren et al. (2020) Yazhou Ren, Ni Wang, Mingxia Li, and Zenglin Xu. 2020. Deep density-based image clustering. Knowledge-Based Systems 197 (2020), 105841.
Ros and Riad (2024) Frédéric Ros and Rabia Riad. 2024. DLCS: A deep learning-based Clustering solution without any clustering algorithm, Utopia? Knowledge-Based Systems 296 (2024), 111834.
Schaeffer (2007) Satu Elisa Schaeffer. 2007. Graph clustering. Computer science review 1, 1 (2007), 27–64.
Sheng et al. (2025) Boheng Sheng, Jiacheng Yao, Meicong Zhang, and Guoxiu He. 2025. Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 31857–31876. doi:10.18653/v1/2025.acl-long.1538
Song et al. (2025) Xin Song, Zhikai Xue, Guoxiu He, Jiawei Liu, and Wei Lu. 2025. Interweaving Memories of a Siamese Large Language Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.39. 25155–25163.
Su et al. (2022) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741 (2022).
Tao et al. (2021) Yaling Tao, Kentaro Takagi, and Kouta Nakata. 2021. Clustering-friendly representation learning via instance discrimination and feature decorrelation. arXiv preprint arXiv:2106.00131 (2021).
Tian et al. (2014) Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. 2014. Learning deep representations for graph clustering. In Proceedings of the AAAI conference on artificial intelligence, Vol.28.
Treeratpituk and Callan (2006) Pucktada Treeratpituk and Jamie Callan. 2006. Automatically labeling hierarchical clusters. In Proceedings of the 2006 international conference on Digital government research. 167–176.
Viswanathan et al. (2024) Vijay Viswanathan, Kiril Gashteovski, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, and Graham Neubig. 2024. Large Language Models Enable Few-Shot Clustering. Transactions of the Association for Computational Linguistics 12 (2024), 321–333.
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
Wang et al. (2024) Xiujuan Wang, Keke Wang, Kangmiao Chen, Zhengxiang Wang, and Kangfeng Zheng. 2024. Unsupervised twitter social bot detection using deep contrastive graph clustering. Knowledge-Based Systems 293 (2024), 111690.
Wang et al. (2023) Zihan Wang, Jingbo Shang, and Ruiqi Zhong. 2023. Goal-Driven Explainable Clustering via Language Descriptions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10626–10649. doi:10.18653/v1/2023.emnlp-main.657
Wu et al. (2019) Jianlong Wu, Keyu Long, Fei Wang, Chen Qian, Cheng Li, Zhouchen Lin, and Hongbin Zha. 2019. Deep comprehensive correlation mining for image clustering. In Proceedings of the IEEE/CVF international conference on computer vision. 8150–8159.
Xu et al. (2015) Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 62–69.
Yang et al. (2025) Beihua Yang, Peng Song, Yuanbo Cheng, Zhaowei Liu, and Yanwei Yu. 2025. Label completion based concept factorization for incomplete multi-view clustering. Knowledge-Based Systems 310 (2025), 112953.
Yang et al. (2016) Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5147–5156.
Yang et al. (2010) Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. 2010. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing 19, 10 (2010), 2761–2773.
Yao et al. (2025) Jiacheng Yao, Xin Xu, and Guoxiu He. 2025. Metacognitive symbolic distillation framework for multi-choice machine reading comprehension. Knowledge-Based Systems 312 (2025), 113130.
Yin et al. (2017) Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. 2017. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 555–564.
Zhang et al. (2018) Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han. 2018. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2701–2709.
Zhang et al. (2015) Wen Zhang, Xijin Tang, and Taketoshi Yoshida. 2015. TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge-Based Systems 75 (2015), 152–160.
Zhang et al. (2025) Yujie Zhang, Yan Jiang, Pengwei Yan, Zhuoren Jiang, Chenxi Lin, Guoxiu He, and Xiaozhong Liu. 2025. Risks to Scientific Peer Review in the Era of Technoscientific Acceleration: Evidence from AI Research. Available at SSRN 5287265 (2025).
Zhang et al. (2023) Yuwei Zhang, Zihan Wang, and Jingbo Shang. 2023. ClusterLLM: Large Language Models as a Guide for Text Clustering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 13903–13920.
Zhou et al. (2022) Sheng Zhou, Hongjia Xu, Zhuonan Zheng, Jiawei Chen, Jiajun Bu, Jia Wu, Xin Wang, Wenwu Zhu, Martin Ester, et al. 2022. A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. arXiv preprint arXiv:2206.07579 (2022).
Zhou et al. (2009) Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment 2, 1 (2009), 718–729.

Xet Storage Details

Size:: 73.7 kB
Xet hash:: 957479e253751e3953cef701509b47bb2dfd7fe7d89502f47ccfe04c8b152e47

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.