Clustering nodes in a graph is a useful general technique in data mining of large network data sets. In this context, Newman and Girvan [9] recently proposed an objective func- tion for graph clustering called the Q function which allows automatic selection of the number of clusters. Empirically, higher values of the Q function have been shown to correlate well with good graph clusterings. In this paper we show how optimizing the Q function can be reformulated as a spectral relaxation problem and propose two new spectral clustering algorithms that seek to maximize Q. Experimental results indicate that the new algorithms are e–cient and efiective at
flnding both good clusterings and the appropriate number of clusters across a variety of real-world graph data sets. In addition, the spectral algorithms are much faster for large sparse graphs, scaling roughly linearly with the number of nodes n in the graph, compared to O(n2) for previous clus- tering algorithms using the Q function. 1 Introduction Large complex graphs representing relationships among sets of entities are an increasingly common focus of scientiflc inquiry. Examples include social networks, Web graphs, telecommunication networks, semantic networks, and biological networks. One of the key questions in understanding such data is How many communities are there and what are the community memberships"? Algorithms for flnding such communities, or auto- matically grouping nodes in a graph into clusters, have been developed in a variety of difierent areas, includ- ing VLSI design, parallel computing, computer vision, social networks, and more recently in machine learn- ing. Good algorithms for graph clustering hinge on the quality of the objective function being used. A vari- ety of difierent objective functions and clustering algo- rithms have been proposed for this problem, ranging from hierarchical clustering to max- ow/min-cut meth- ods to methods based on truncating the eigenspace of a suitably-deflned matrix. In recent years, much attention has been paid to spectral clustering algorithms (e.g., [11],[12],[14]) that, explicitly or implicitly, attempt to ⁄The research in this paper was supported by the National Science Foundation under Grant IRI-9703120 as part of the Knowledge Discovery and Dissemination program. SW was also supported by a National Defense Science and Engineering Graduate Fellowship. yDepartment of Computer Science, University of California, Irvine globally optimize cost functions such as the Normalized Cut measure [12]. The majority of these approaches at- tempt to balance the size of the clusters while minimiz- ing the interaction between dissimilar nodes. However, for the types of complex heterogeneous networks that arise naturally in many domains, the bias that these ap- proaches have towards clusters of equal size can be seen as a drawback. Furthermore, many of these measures, such as Normalized Cut, can not be used directly for selecting the number of clusters, k, since they increase (or decrease) monotonically as k is varied. Recently, a new approach was developed by New- man and Girvan [9] to overcome limitations of previ- ous measures for measuring community structure. They proposed the modularity function" Q, which directly measures the quality of a particular clustering of...
Website: www-stat.stanford.edu | Filesize: -
No of Page(s): 12
Download A Spectral Clustering Approach To Finding ... - Department of Statistics.pdf
No comments:
Post a Comment