We have investigated the problem of clustering documents according to their semantics, given incomplete and incoherent hints reflecting the documents’ affinities. The problem has been rigorously defined using graph theory in set-theoretic notation. We have proved the problem to be NP-hard, and proposed five heuristic algorithms which deal with the problem using five quite different approaches: a greedy algorithm, an iterated finding of maximum cliques, energy minimization inspired by molecular me- chanics, a genetic algorithm, and an adaptation of the Girvan-Newman algorithm. As
a side effect of the fourth heuristic, an efficient and aesthetically appealing method of visualization of the large graphs in question has been developed. The approaches have been tested empirically on the network of links between arti- cles from over 250 language editions of Wikipedia. A thorough analysis of the network has been performed, showing surprisingly large semantic drift patterns and an uncom- mon topology: a scale-free skeleton linking tight clusters. It has been demonstrated that, using a blend of the proposed approaches, it is possible to automatically detect, and to a large extent eliminate, the semantic drift in the network of links between the language editions of Wikipedia. Last but not least, an open-source implementation of the proposed algorithms has been documented. To my wife Duygu and my son Leon Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Literature Review and State of the Art 15 2.1 Computational Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Models of Network Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Power-Law Distributions . . . . ....
Website: www.icm.edu.pl | Filesize: -
No of Page(s): 102
Download Methods of Semantic Drift Reduction in Large Similarity Networks.pdf
No comments:
Post a Comment