Survey of Overlapping Clustering Algorithms

Beltrán, Beatriz; Vilariño, Darnes; Beltrán, Beatriz; Vilariño, Darnes

doi:10.13053/cys-24-2-3391

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.24 no.2 Ciudad de México abr./jun. 2020 Epub 04-Oct-2021

https://doi.org/10.13053/cys-24-2-3391

Article of the thematic issue

Survey of Overlapping Clustering Algorithms

Beatriz Beltrán¹^*

Darnes Vilariño¹

¹1 Benemérita Universidad Autónoma de Puebla, Faculty of Computer Science, Mexico, bbeltran@cs.buap.mx, darnes@cs.buap.mx

Abstract:

This paper is presented as a study of the overlapping clustering algorithms that have been developed in the last years, researchers have been working on these algorithms in different ways, in some cases they are based on widely known algorithms such as k-means and in others that work with heuristics or graphs. The need to work in clustering algorithms with overlap is due to the fact that currently there are many problems that require that the obtained groups be non-exclusive and for which it gives the guideline for this analysis. The algorithms included in this analysis are: ADditive CLUstering, Overlapping K-means, Dynamic Overlapping Clustering based on Relevance, Overlapping Clustering based on Density and Compactness, MCLC, A tree-based incremental overlapping clustering method, INDCLUS and Hybrid K-means.

Keywords: Clustering algorithms; supervised classification; overlapping cluster

1 Introduction

The classification of objects or elements according to the similarities is one of the fundamental bases to learn and understand. The classification of elements arises in the human being since childhood, for example, to place objects by colors or shapes. The cluster analysis helps in the development of methods and algorithms to group and classify. Also, the problem of clustering data is being widely studied in data mining and machine learning, being its applications included to sumaries, learning, image segmentation, and marketing.

There are different ways to classify clustering algorithms, in particular, by the type of obtained clusters, [¹⁴] propose the following classification: disjoint, when an element belongs to exactly one cluster, for example, cluster movies by their content(AA, A, B, B15, C and D), in fuzzy, when an element belongs to all clusters but with a certain degree of belonging, for example, the clustering of a range of a million colors; and finally those that are overlapped, where an element may belong to more than one cluster, for example, the likes of feeding people.

Another categorization, where an exclusive and non-exclusive classification is proposed, is indicated in [¹⁴]. The first considers disjoint clusters, and the second one allows overlaps. Within the exclusive classification, it is used the intrinsic, where a proximity matrix is used, it is also known as unsupervised learning. The extrinsic classification uses labels for the elements.

The intrinsic classification is sub-classified in hierarchical and partitioned depending on the imposed structure of the data. The hierarchical clustering can be divisive (these algorithms form clusters by separating the existing ones) considering some similarity measure. The partitioned clustering takes into account a k parameter, which indicates the number of clusters. This taxonomy is shown in 1.

Fig. 1 Classification types (referenced by Jain & Dubes)

Following with the hierarchical algorithms, different authors have developed algorithms of this kind, working on different domains [⁷, ¹⁷], with coverage subgraphs to cluster documents [²], density subgraphs [³], suffix trees [²²], center based [⁹], density [¹¹], objective functions and dendograms [¹²], using the closest neighbor for the identification of duplicates [⁵] and predictions [¹⁰], among others.

Among the iterative algorithms, there are jobs where researchers work with maximum likelihood [¹⁶], using EM with Gaussian Mixtures [²¹], hybrid algorithms using GSA y K-means [¹³], etc. The used techniques are varied and with different results, these techniques have been applied in different types of data, such as texts, images and with discrete, numeric and categorical data.

In the particular case of the present study, it is carried out on clustering algorithms with overlap. The work is organized as follow, it starts with an explanation of the chosen algorithms, to continue in the next section with an analysis of the computational behavior of the presented algorithms, and the specification of the tested data with each technique, the last section will contain the conclusions.

2 Clustering Algorithms with Overlap

This section explains the solutions that different authors have given to the clustering problem, taking into account overlapping clusters.

2.1 ADditive CLUStering (ADCLUS)

One of the first works is [¹] where a new model of clustering is described, in this model, the restrictions of the clustered objects in an exhaustive or mutually exclusive categories is relaxed allowing the establishment of overlapping clusters. It is notorious that many datasets to be grouped do not require exclusive clusters, it is from there the need of a solution with overlap, but create all possible overlapping sets gives a total of 2n−1 clusters, therefore, heuristics are needed to select potential groups.

ADCLUS is the proposed model, which cluster elements that meet some property, with a certain weight. ADCLUS considers n objects to be grouped and a symmetric proximity matrix M=n(n−1)/2. The data is transformed into similarities in the [0,1] interval. The underlying basic equation of ADCLUS is:

S^ij=∑k=1mwkPikPjk, (1)

where S^ij is theoretically the reconstructed similarity in the objects between i y j, wk is a non-negative weight. So, the similarity between a pair of objects is the sum of the weight of those groups that contain both objects. In addition, the MAPCLUS algorithm builds a matrix starting of a set of weights and defined subsets, surpassing ADCLUS.

2.2 Overlapped K-Means (OKM)

The reason for performing an overlapping algorithm based on K-Means[⁴], is due to different applications in information retrieval, natural language processing, chemistry, biology, medicine; among others, where an overlapping data coverage is required, so an objective criterion is proposed associated with the OKM algorithm that generalizes the K-means algorithm.

The objective criterion is defined as follows: Given a set of data vectors X={xi}i=1n with xi∈ℝn, to find a k-way coverage {πc}c=1k, where πc represents the cth group; such that the following goal is minimized:

J({πc}c=1k)=∑xi∈X‖xi−ϕ(xi)‖2, (2)

where each xi must be in at least one group, so: ∪c=1kπc=X y ϕ(xi) denotes the image of xi defined by the combinations of the (mc) prototypes for the group xi as:

ϕ(xi)=∑Aimc|Ai|, (3)

where Ai the assignment set of: xi:{mc|xi∈πc}. So an heuristic for obtaining the optimal coverage was developed.

2.3 Dynamic Overlapping Algorithm based on Relevance (DClustR)

The DClustR algorithm [¹⁸] is an algorithm that allows the overlap between its groups, as an alternative for analysis in social networks, information retrieval and bioinformatics, this algorithm is based on graph theory by introducing strategies for the construction of more precise overlapping clusters or when the collection changes.

The main idea is to generate a set of clusters that are a coverage G˜β using ws-graphs and subsequently improve the initial clusters to obtain the right one, so “improve” means reducing both the number of clusters as the overlap between them.

To define the ws-graph, let Gβ=〈V∗,E∗,S〉 a graph of similarity threshold with weight. A star shape weighted subgraph (ws-graph) in G˜β, denoted by Gβ∗=〈V∗,E∗,S〉 is a subgraph of G˜β having a vertex c∈v∗ such that there is an arch between c and another vertex un V∗. The vertex c is called the center of the ws-graph and the rest of the vertexes are called satellites. Isolated vertexes are considered degenerate ws-graphs.

Being the ws-graph determined by its center, the problem is to build the set W={Gc1∗,Gc2∗,…,Gck∗} of ws-graphs, such that W if a coverage of G˜β, can be seen as the problem of reconstructing the set X={c1,c2,…,ck} such that ci∈X is the center of G˜ci∈W,∀i=1,…,k.

To avoid analyzing all vertexes in v∗ and delimit the search space, a selection criterion is established, DClustR introduces the concept of relevance of a vertex, for which it can be selected those vertexes with the highest degree, and its about maximizing the number of added vertexes to the G˜β coverage in each iteration.

2.4 Overlapping Clustering based on Density and Compactness (OCDC)

The OCDC algorithm [¹⁹], introduces a new graph coverage and a new filter strategy, with which a small set of overlapping clusters are can be obtained. The collection of objects are represented as a graph of similarity threshold with weight G˜β, the overlapped clustering is realized in two phases: the initialization and the improvement.

In the initialization phase, an initial set of groups of coverage vertexes is built G˜β, using the ws-graphs, in this context, each of these graphs make up an initial group. In this step, the algorithm seeks to reduce the search space, and OCDC introduces the concepts of density and compactness of a vertex v. The density of a vertex v∈V, is calculated by using the following equation:

v.density=v.pre_dens|v.Adj|, (4)

where v.pre_dens is the number of adjacent vertexes to v, having a degree not greater than the degree of v, and v.Adj is the total of adjacent vertexes to v.

The compactness of a vertex v∈V, is estimated as follow:

v.compacness=v.pre_comptv.Adj, (5)

where v.pre_compt is the number of vertexes u∈v.Adj such that Aprox_Intra_sim(Gv∗)≥Aprox_Intra_sim(Gu∗), where Gv∗ and Gu∗ are ws-graphs determined by v y u, respectively. The greater value of the compactness is included in coverage and therefore it is the best coverage graph.

2.5 MCLC Algorithm

The MCLC algorithm is proposed to discover overlapping communities [⁶], for which is used a random path in a line graph and attraction intensity. Unlike the traditional random path that starts from a node, it starts from a link. In the first instance, a network graph is transformed into a weighted linear graph, and the random path in this linear graph is associated with a string of Markov. In order to obtain the probability of the Markov’s chains, a similarity between the pair of leagues is obtained. Then, the leagues can be grouped into “league communities” where those nodes can be overlapped.

The league communities become “node communities”, and a attraction intensity is defined for the control of the overlap’s size. Finally the communities that allow overlapping are detected.

The distance or similarity between pairs of leagues is obtained by calculating the probability transition of random paths in the linear graph. A matrix of M×M, can be associated to a chain of M-Markov’s states, the transition matrix P=[Pαβ], is defined as:

Pαβ=hαβ∑βhαβ. (6)

The number of repetitions of random paths that start from the league α, [Pt]αβ should be considered and is the probability that the path starts from α and remain in β which is ∑t=1T[Pt]αβ(1≥t≥T)

The cluster analysis can use “peer league pairs” in communities of candidate leagues, so a similarity (symetric)ϕαβ is proposed as follow:

ϕαβ=ϕβα=∑t=1T([Pt]αβ+[Pt]βα). (7)

The distance dαβ between pair of leagues (α,β) is obtained by the complement of the similarity and normalization of the results between 0 and 1:

dαβ=dβα=1−ϕαβ−min ϕmax ϕ−min ϕ. (8)

2.6 Clustering Method with Incremental Overlapping based on Trees

The clustering method with incremental overlap based on trees [²⁰], uses the tripartite decision theory. A tree is represented by points that can improve the relevance of the search result. The overlapped clusters are represented by the tripartite decision with a set of intervals. Tripartite decision strategies are designed for the update of clusters, at the moment that the data increases. Further, with this method is possible to determine the number of clusters during the process.

To define the tripartite decision clustering, be U={x1,…,xn,…,xN} the universe, and the resulting clusters C={C1,…,Ck,…,CK} a family of clusters of the universe. xn is an object, which has D attributes, xn=(xn1,…,xnd,…,xnD), where xnd represents the value of the d-th attribute of the object xn, where n∈{1,…,N} y d∈{1,…,D}.

The algorithm starts calculating the distance(Euclidean) between objects. The similarity between the objects is obtained with the complement of distance. Subsequently, the representative points are calculated using the following condition: if |Neighbor(r)|≥ζ, r is a representative point and represents the object in the area where r is centered and with radius δ.

The next step is the construction of an indirect G graph based on the R representation points, this is achieved using the tripartite decision and the calculated representative points. Finally, the algorithm search in the subgraph those who are strongly connected in the graph G. This design allows the growth of the data, but when increasing, it is required to simulate different situations to evaluate the performance of the method.

2.7 INDCLUS

In this section is examined the scalability of the ADCLUS and INDCLUS models [⁸], which are techniques that can be used to extract overlapping clusters with similar data. In this paper was taken the models ADCLUS and INDCLUS appropriately and were designed different metaheuristics extensions to have more relaxed models.

For the INDCLUS model, N elements are considered, with a similarity matrix S=(sij){N×N}, and is required a group of a known number of M overlapped clusters possibilities. The INDCLUS model requires to minimize the optimization functions:

min∑k=1K∑i=1N∑j≠i(Skij−∑m=1MwkmPimPjm−ck)2wkm≥0,∀k=1,…,K;m=1,…,M,ck≥0,∀k=1,…,K,Pim∈{0,1}∀i=1,…,N;m=1,…,M, (9)

where K is the number of subjects, N is the number of elements to be stored and skij is the

similarity of the elements i and j del sujeto k. If K=1 the model is reduced to ADCLUS.

The used heuristics with these algorithms are: alternating approach of minimum squares (SINDCLUS), a symmetric approach applied to SINDCLUS (SYMPRES), simulated annealing (SA-SINDCLUS), tabu search (TABU-SINDCLUS), and relaxed solution space (SMC-Relax). The tests were realized with medium size real datasets, SMC-Relax had the a better execution than SINDCLUS and SYMPRES. The use of heuristics makes the ADCLUS and INDCLUS models scalable.

2.8 Hybrid K-Means

In [¹⁵] is described an algorithm (HKM-OKM) that combines harmonic k-means with overlapped k-means. By making use of the overlapped k-means algorithm; which is an extension of k-means is sensitive to the centroid of the initial cluster but when is combined with harmonic k-means this limitation can be overcome.

The main idea in this method is to use the output HKM method to initialize the centroids of the OKM method. The OKM method was explained in section 2.2. The HKM algorithm introduces a bias(using the weight) to move the cluster centers to the data points that are most important according some criteria.

Similar to the k-means algorithm, the HKM method can be formulated as an optimization problem where the objective is to minimize:

Q″(π)=∑i=1nk∑j=1k1‖xi→−zj→‖p, (10)

where p is a free parameter (typically p≥2), and the expression (k∑j=1k1‖xi→−zj→‖p) is the harmonic media. To calculate the harmonic media, the algorithms needs to calculate the cluster centroid zj→ by using:

zj→=∑i=1nm(zj→|xi→)w(xi→)xi→∑i=1nm(zj→|xi→)w(xi→), (11)

where m(zj→|xi→) is a member of the data point xi→ to the cluster centroid j calculated by:

m(zj→|xi→)=‖xi→−zj→‖−p−2∑j=1k‖xi→−zj→‖−p−2. (12)

And w(xi→) is the associated weight which each xi→ point, calculated by:

w(xi→)=∑j=1k‖xi→−zj→‖−p−2(∑j=1k‖xi→−zj→‖−p)2. (13)

The HKM-OKM algorithm starts by finding centers, using the HKM, initializes OKM using the found centers. A set of medical data is used, because it is required to model elements with overlap. It improves the obtained results by OKM.

3 Algorithms Analysis

The analyzed algorithms have a maximum time complexity of the quadratic order as they are: OCDC, MDLC based on trees, INDCLUS with hybrid heuristics and k-means; the particular case of OKM maintains the order of the algorithm on which it was based (k-means), and only the ADCLUS algorithm is of the cubic order. This information can be reviewed in table 1.

Table 1 Comparison between clustering algorithms with overlap.

Algorithm	Datatype	Amount of Data	Complexity
ADCLUS	Discrete	105	O(n3)
OKM	Qualitative - Documents	1,308	O(t⋅n⋅k log⁡ k)
DClustR	Qualitative – Documents	16,006	O(\|V\|+\|Eβ\|)
OCDC	Documents	16,006	O(n2)
MCLC	Discrete	1,133	O(m2n)
Based on Trees	Discrete	5,473	O(n2+n log⁡ n)
INDCLUS	Qualitative – Documents	102,294	O(n2)
Hybrid K-Means	Qualitative	699	O(n2)

In the experiments the number of instances that were used with these algorithms varies, being the minimum of 105 and a maximum of 102, 294 instances. Further, the objects are of different types: discrete, qualitative or documents. In all experiments were tested stable datasets, for example, the repository UCI Machine Learning^{^fn} is used, testing the cancer dataset, heart disease, parkinson, among others, also the dataset on the karate Zachary^{^fn}, KDD, ISOLET were used; Reuters-21578^{^fn}, TDT2^{^fn}, were mainly used for the dataset with documents.

In general, clustering algorithms with overlap are based on others algorithms that do not support overlap and even improve some aspects of the same.

4 Conclusion and Future Work

In this article, the clustering algorithms with overlap were analyzed.

Over the last years, the interest in the development and improvement of this type of algorithms has been constant and researchers continue to seek to improve the obtained results.

Different techniques have been used in this type of algorithms, from basic algorithms such as k-means, using heuristics to have scalability. In addition, graph theory has also been used and finally, the combination of algorithms was used to counteract some deficiencies of the algorithms.

The amount of elements that have been worked with these algorithms isn’t very large in general, standardized datasets are used and the quality of the algorithms is verified by standard measures such as F-Measure or F-Bcubed.

Acknowledgements

We would like to thank to the vice-rectory of research and postgraduate studies (VIEP) of the Benemérita Universidad Autónoma de Puebla for the economical support.

References

1. 1. Arabie, P., Carroll, J. D., DeSarbo, W., & Wind, J. (1981). Overlapping clustering: A new method for product positioning. Journal of Marketing Research, Vol. 18, No. 3, pp. 310–317. [ Links ]

2. 2. Aslam, J. A., Pelekhov, E., & Rus, D. (1999). A practical clustering algorithm for static and dynamic information organization. Proceedings of the 1999 Symposium on Discrete Algorithms, pp. 51–60. [ Links ]

3. 3. Aslam, J. A., Pelekhov, E., & Rus, D. (2004). The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, Vol. 8, No. 1, pp. 95–129. [ Links ]

4. 4. Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. 2008 19th International Conference on Pattern Recognition, pp. 1–4. [ Links ]

5. 5. Costa, G., Manco, G., & Ortale, R. (2009). An incremental clustering scheme for data de-duplication. Data Mining and Knowledge Discovery, Vol. 20, No. 1, pp. 152. [ Links ]

6. 6. Deng, X., Li, G., Dong, M., & Ota, K. (2017). Finding overlapping communities based on markov chain and link clustering. Peer-to-Peer Networking and Applications, Vol. 10, No. 2, pp. 411–420. [ Links ]

7. 7. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, Vol. 2, No. 2, pp. 139–172. [ Links ]

8. 8. France, S. L., Chen, W., & Deng, Y. (2017). Adclus and indclus: analysis, experimentation, and meta-heuristic algorithm extensions. Advances in Data Analysis and Classification, Vol. 11, No. 2, pp. 371–393. [ Links ]

9. 9. Gan, G., Ma, C., & Wu, J. (2007). Data clustering -theory, algorithms, and applications. SIAM. [ Links ]

10. 10. Gan, H., Fan, Y., Luo, Z., & Zhang, Q. (2018). Local homogeneous consistent safe semi-supervised clustering. Expert Systems with Applications, Vol. 97, pp. 384–393. [ Links ]

11. 11. Ghosh, J., Liu, A., & Gupta, G. (2008). Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 7, pp. 223–237. [ Links ]

12. 12. Gilpin, S., & Davidson, I. (2017). A flexible ilp formulation for hierarchical clustering. Artificial Intelligence, Vol. 244, No. C, pp. 95–109. [ Links ]

13. 13. Hatamlou, A., Abdullah, S., & Nezamabadi-Pour, H. (2012). A combined approach for clustering based on k-means and gravitational search algorithms. Swarm and Evolutionary Computation, Vol. 6, pp. 47–52. [ Links ]

14. 14. Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. [ Links ]

15. 15. Khanmohammadi, S., Adibeig, N., & Shanehbandy, S. (2017). An improved overlapping k-means clustering method for medical applications. Expert Syst. Appl., Vol. 67, No. C, pp. 12–18. [ Links ]

16. 16. Kneser, R., & Ney, H. (1993). Improved clustering techniques for class-based statistical language modelling. EUROSPEECH, ISCA. [ Links ]

17. 17. Patnaik, A. K., Bhuyan, P. K., & Rao, K. K. (2016). Divisive analysis (diana) of hierarchical clustering and gps data for level of service criteria of urban streets. Alexandria Engineering Journal, Vol. 55, No. 1, pp. 407–418. [ Links ]

18. 18. Pérez-Suárez, A., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & Medina-Pagola, J. E. (2013). An algorithm based on density and compactness for dynamic overlapping clustering. Pattern Recognition, Vol. 46, No. 11, pp. 3040–3055. [ Links ]

19. 19. Pérez-Suárez, A., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & Medina-Pagola, J. E. (2013). A new overlapping clustering algorithm based on graph theory. Batyrshin, I., & González Mendoza, M., editors, Advances in Artificial Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 61–72. [ Links ]

20. 20. Yu, H., Zhang, C., & Wang, G. (2016). A tree-based incremental overlapping clustering method using the three-way decision theory. Knowledge-Based Systems, Vol. 91, pp. 189–203. Three-way Decisions and Granular Computing. [ Links ]

21. 21. Yu, J., Chaomurilige, C., & Yang, M.-S. (2018). On convergence and parameter selection of the em and da-em algorithms for gaussian mixtures. Pattern Recognition, Vol. 77, pp. 188–203. [ Links ]

22. 22. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, ACM, New York, NY, USA, pp. 46–54. [ Links ]