Semi-Supervised Clustering Using Genetic Algorithms Ayhan Demiriz Dept. of Decision Sciences and Engineering Systems Kristin P. Bennett∗ Dept. of Mathematical Sciences Mark J. Embrechts Dept. Decision Sciences and Engineering Systems Rensselaer Polytechnic Institute Troy, NY 12180 May 26, 1999 Abstract A semi-supervised clustering algorithm is proposed that combines the benefits of supervised and unsupervised learning methods. Data are seg- mented/clustered using an unsupervised learning technique that is biased toward producing segments or clusters as pure as possible in terms of
class distribution. These clusters can then be used to predict the class of future points. For example in database marketing, the technique can be used to identify and characterize segments of the customer population likely to respond to a promotion. One benefit of the approach is that it allows unlabeled data with no known class to be used to improve classi- fication accuracy. The objective function of an unsupervised technique, e.g. K-means clustering, is modified to minimize both the within cluster variance of the input attributes and a measure of cluster impurity based on the class labels. Minimizing the within cluster variance of the examples is a form of capacity control to prevent overfitting. For the the output labels, impurity measures from decision tree algorithms such as the Gini index can be used. A genetic algorithm optimizes the objective function to produce clusters. Non-empty clusters are labeled with the majority class. Experimental results show that using class information improves the generalization ability compared to unsupervised methods based only on the input attributes. The results also indicate that the method per- forms very well even when few training examples are available. Training using information from unlabeled data can improve classification accuracy on that data as well. ∗http://www.math.rpi.edu/ bennek 1 1 Introduction In this work, we examine an approach to solve classification problems that com- bines supervised and unsupervised learning techniques. In supervised learning, we assume we are given a set of labeled training points, and the task is to construct some function that will correctly predict the labels of future points. In unsupervised learning such as clustering, the task is to segment unlabeled training data into clusters that reflect some meaningful structure in the data. For the Semi-Supervised Learning Problem (SSLP) we assume that we are given both labeled and unlabeled points. The task is then to predict the labels of the unlabeled points using all the available labeled data, as well as unlabeled data. One would expect a better generalization abilityof the resulting classifier due to the better understanding of the input distribution resulting from using all the available data. The semi-supervised learning problem can be used to perform transductive inference [17]. In transduction, we are given a set of training data and a set of testing data, and the learning task is to predict the labels of the specific testing data only. Different testing data will produce different classifica-...
Website: reference.kfupm.edu.sa | Filesize: -
No of Page(s): 20
Download Semi-Supervised Clustering Using Genetic Algorithms.pdf
No comments:
Post a Comment