Saturday, October 20, 2012

Graph clustering with network structure indices

Graph clustering has become ubiquitous in the study of relational data sets. We ex- amine two simple algorithms: a new graphical adaptation of the k-medoids algorithm and the Girvan-Newman method based on edge betweenness centrality. We show that they can be effective at discovering the la- tent groups or communities that are defined by the link structure of a graph. However, both approaches rely on prohibitively expensive computations, given the size of modern relational data sets. Network structure in- dices (NSIs) are a proven technique for indexing network structure

and efficiently finding short paths. We show how incorporating NSIs into these graph clustering algorithms can overcome these complexity limitations. We also present promising quantitative and qualitative evaluations of the modified algorithms on synthetic and real data sets. 1. Introduction Clustering data is a fundamental task in machine learning. Given a set of data instances, the goal is to group them in a meaningful way, with the interpretation of the grouping dictated by the domain. In the context of relational data sets — that is, data whose instances are connected by a link structure representing domain-specific relationships or statistical dependency — the clustering task becomes a means for identifying communities within networks. For example, in the bibliographic domain, we find net- works of scientific papers. Interpreted as a graph, vertices (papers) are connected by an edge when one cites Appearing in Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). the other. Given a specific paper (or group of papers), one may try to find out more about the subject matter by pouring through the works cited, and perhaps the works they cite as well. However, for a sufficiently large network, the number of papers to investigate quickly becomes overwhelming. By clustering the graph, we can identify the community of relevant works surrounding the paper in question. In the sections that follow, we discuss methods for clustering such graphs into groups that are solely determined by the network structure (e.g., co-star relations between actors or citations among scientific papers). Some of the simplest approaches to graph clustering are also very effective. We consider two algorithms: a graphical version of the k-medoids data cluster- ing algorithm (Kaufman & Rousseeuw, 1990) and the Girvan-Newman algorithm (2002). While both techniques perform well, they are computationally expensive to the point of intractibility when run on even moderate-size relational data sets. Using the indexing methods described by Rattigan, Maier, and Jensen (2006), we can drastically reduce the computational complexity of these algorithms. Surprisingly, this in- crease in scalability does not hinder performance. 2. Graph clustering algorithms 2.1. Evaluating clustering performance Before examining the details of the graph clustering algorithms, we introduce a framework for analyzing and evaluating clustering performance. We evaluate candidate algorithms on randomly generated uni- form clustered graphs...

Website: kdl.cs.umass.edu | Filesize: -
No of Page(s): 8
Download Graph clustering with network structure indices - Knowledge ....pdf

No comments:

Post a Comment